LTFN 7: A Quick Look at TensorFlow Optimizers

Part of the series Learn TensorFlow Now

So far we’ve managed to avoid the mathematics of optimization and treated our optimizer as a “black box” that does its best to find good weights for our network. In our last post we saw that it doesn’t always succeed: We had three networks with identical structures but different initial weights and our optimizer failed to find good weights for two of them (when the initial weights were too large in magnitude and when they were too small in magnitude).

I’ve avoided the mathematics primarily because I believe one can become a machine learning practitioner (but probably not researcher) without a deep understanding of the mathematics underlying deep learning. We’ll continue that tradition and avoid the bulk of the mathematics behind the optimization algorithms. That said, I’ll provide links to resources where you can dive into these topics if you’re interested.

There are three optimization algorithms you should be aware of:

  1. Stochastic Gradient Descent – The default optimizer we’ve been using so far
  2. Momentum Update – An improved version of stochastic gradient descent
  3. Adam Optimizer – Typically the best performing optimizer

 

Stochastic Gradient Descent

To keep things simple (and allow us to visualize what’s going on) let’s think about a network with just one weight. After we run our network on a batch of inputs we are given a cost. Our goal is to adjust the weight so as to minimize that cost. For example, the function could look something like the following (with our weight/cost highlighted):

A (completely made up) cost function, with a (completely made up) weight and corresponding cost

 

We can obviously look at this function and be confident that we want to increase weight_1. Ideally we’d just increase weight_1 to give us the cost at the bottom of the curve and be done after one step.

In reality, neither we nor the network have any idea of what the underlying function really looks like. We know three things:

  1. The value of weight_1
  2. The cost associated with our (one-weight) network
  3. A rough estimate of how much we should increase or decrease weight_1 to get a smaller cost

(That third piece of information is where I’ve hidden most of the math and complexities of neural networks away. It’s the gradient of the network and it is computed for all weights of the network via back-propagation)

With these three things in mind, a better visualization might be:

It’s a lot harder to tell how far we should increase weight_1 now, isn’t it?

 

So now we still know that we want to increase weight_1, but how much should we increase it? This is partially decided by learning_rate. Increasing learning_rate means that we adjust our weights by larger amounts.

The update step of stochastic gradient descent consists of:

  1. Find out which direction we should adjust the weights
  2. Adjust the weights by multiplying learning_rate by the gradient

We have been using this approach whenever we have been using tf.train.GradientDescentOptimizer.

 

Momentum Update

One problem with stochastic gradient descent is that it’s slow and can take a long time for the optimizer to converge on a good set of weights. One solution to this problem is to use momentum. Momentum simply means: “If we’ve been moving in the same direction for a long time, we should probably move faster and faster in that direction”.

We can accomplish this by adding a momentum factor (typically ~0.9) to our previous one-weight example:

We use velocity to keep track of the speed and direction in which weight_1 is increasing or decreasing. In general, momentum update works much better that stochastic gradient descent. For a math-focused look at why see: Why Momentum Works.

The TensorFlow momentum update optimizer is available at tf.train.MomentumOptimizer.

 

Adam Optimizer

The Adam Optimizer is my personal favorite optimizer simply because it seems to work the best. It combines the approaches of multiple optimizers we haven’t looked at so we’ll leave out the math and instead show a comparison of Adam, Momentum and SGD below:

 

Instead of using just one weight, this example uses two weights: x and y. Cost is represented on the z axis with blue colors representing smaller values and the star represented the global minimum.

Things to note:

  1. SGD is very slow. It doesn’t make it to the minima in the 120 training steps
  2. Momentum sometimes overshoots its target
  3. Adam seems to offer a somewhat reasonable balance between the two

The Adam Optimizer is available at tf.train.AdamOptimizer.

 

Additional Resources:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s