*Part of the series Learn TensorFlow Now*

So far we’ve managed to avoid the mathematics of optimization and treated our optimizer as a “black box” that does its best to find good weights for our network. In our last post we saw that it doesn’t always succeed: We had three networks with identical structures but different initial weights and our optimizer failed to find good weights for two of them (when the initial weights were too large in magnitude and when they were too small in magnitude).

I’ve avoided the mathematics primarily because I believe one can become a machine learning practitioner (but probably not researcher) without a deep understanding of the mathematics underlying deep learning. We’ll continue that tradition and avoid the bulk of the mathematics behind the optimization algorithms. That said, I’ll provide links to resources where you can dive into these topics if you’re interested.

There are three optimization algorithms you should be aware of:

**Stochastic Gradient Descent**– The default optimizer we’ve been using so far**Momentum Update**– An improved version of stochastic gradient descent**Adam Optimizer**– Typically the best performing optimizer

## Stochastic Gradient Descent

To keep things simple (and allow us to visualize what’s going on) let’s think about a network with just one weight. After we run our network on a batch of inputs we are given a `cost`

. Our goal is to adjust the weight so as to minimize that `cost`

. For example, the function could look something like the following (with our weight/cost highlighted):

We can obviously look at this function and be confident that we want to increase `weight_1`

. Ideally we’d just increase `weight_1`

to give us the cost at the bottom of the curve and be done after one step.

In reality, neither we nor the network have any idea of what the underlying function really looks like. We know three things:

- The value of
`weight_1`

- The
`cost`

associated with our (one-weight) network - A rough estimate of how much we should increase or decrease
`weight_1`

to get a smaller cost

(That third piece of information is where I’ve hidden most of the math and complexities of neural networks away. It’s the **gradient** of the network and it is computed for all weights of the network via back-propagation)

With these three things in mind, a better visualization might be:

So now we still know that we want to increase `weight_1`

, but how much should we increase it? This is partially decided by `learning_rate`

. Increasing `learning_rate`

means that we adjust our weights by larger amounts.

The update step of stochastic gradient descent consists of:

- Find out which direction we should adjust the weights
- Adjust the weights by multiplying
`learning_rate`

by the gradient

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.

Learn more about bidirectional Unicode characters

learning_rate = 0.01 #Some human-chosen learning rate | |

gradient_for_weight_1 = … #Compute gradient | |

weight_1 = weight_1 + (–gradient_for_weight1 * learning_rate) #Technically, the gradient tells us how to INCREASE cost, so we go the opposite direction by negating it |

We have been using this approach whenever we have been using `tf.train.GradientDescentOptimizer`

.

## Momentum Update

One problem with stochastic gradient descent is that it’s slow and can take a long time for the optimizer to converge on a good set of weights. One solution to this problem is to use momentum. Momentum simply means: “If we’ve been moving in the same direction for a long time, we should probably move faster and faster in that direction”.

We can accomplish this by adding a momentum factor (typically `~0.9`

) to our previous one-weight example:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.

Learn more about bidirectional Unicode characters

velocity = 0 #No initial velocity. (Defined outside of optimization loop) | |

… | |

momentum = 0.9 | |

learning_rate = 0.01 #Some human-chosen learning rate | |

gradient_for_weight_1 = … #Compute gradient | |

velocity = (momentum * velocity) – (gradient_for_weight_1 * learning_rate) #Maintain a velocity that keeps increasing if we don't change direction | |

weight_1 = weight_1 + velocity |

We use `velocity`

to keep track of the speed and direction in which `weight_1`

is increasing or decreasing. In general, momentum update works much better that stochastic gradient descent. For a math-focused look at why see: Why Momentum Works.

The TensorFlow momentum update optimizer is available at `tf.train.MomentumOptimizer`

.

## Adam Optimizer

The Adam Optimizer is my personal favorite optimizer simply because it seems to work the best. It combines the approaches of multiple optimizers we haven’t looked at so we’ll leave out the math and instead show a comparison of Adam, Momentum and SGD below:

Instead of using just one weight, this example uses two weights: `x`

and `y`

. Cost is represented on the `z`

axis with blue colors representing smaller values and the star represented the global minimum.

Things to note:

- SGD is very slow. It doesn’t make it to the minima in the 120 training steps
- Momentum sometimes overshoots its target
- Adam seems to offer a somewhat reasonable balance between the two

The Adam Optimizer is available at `tf.train.AdamOptimizer`

.