Part of the series Learn TensorFlow Now
At the conclusion of the previous post, we realized that our first convolutional net wasn’t performing very well. It had a comparatively high cost (something we hadn’t seen before) and was performing slightly worse than a fully-connected network with the same number of layers.
Test results from 4-layer fully connected network:
Test Cost: 107.98408660641905 Test accuracy: 85.74999994039536 %
Test results from 4-layer Conv Net:
Test Cost: 15083.0833307 Test accuracy: 81.8799999356 %
As a refresher, here’s a visualization of the 4-layer ConvNet we built in the last post:
So how do we figure out what’s broken?
When writing any typical program we might fire up a debugger or even just use something like
printf() to figure out what’s going on. Unfortunately neural networks makes this very difficult for us. We can’t really step through thousands of multiplication, addition and ReLU operations and expect to glean much insight. One common debugging technique is to visualize all of the intermediate outputs and try to see if there are any obvious problems.
Let’s take a look at a histogram of the outputs of each layer before they’re passed through the ReLU non-linearity. (Remember, the ReLU operation simply chops off all negative values).
If you look closely at the above plots you’ll notice that the variance increases substantially at each layer (TensorBoard doesn’t let me adjust the scales of each plot so it’s not immediately obvious). The majority of outputs at
layer1_conv are within the range
[-1,1], but by the time we get to
layer4_conv the outputs vary between
[-20,000, 20,000]. If we continue adding layers to our network this trend will continue and eventually our network will run into problems with overflow. In general we’d prefer our intermediate outputs to remain within some fixed range.
How does this relate to our high cost? Let’s take a look at the values of our
predictions. Recall that these values are calculated via:
The first thing to notice is that like the previous layers, the values of
logits have a large variance with some values in the hundreds of thousands. The second thing to notice is that once we take the softmax of
logits to create
predictions all of our values are reduced to either
0. Recall that
logits and ensures that the ten values add up to 1 and that each value represents the probability a given image is represented by each digit. When some of our
logits are tens of thousands of times bigger than the others, these values end up dominating the probabilities.
The visualization of
predictions tells us that our network is super confident about the predictions it’s making. Essentially our network is claiming that it is 99% sure of its predictions. Whenever our network makes a mistake it is making a huge mistake and receives a large
cost penalty for it.
The problem with increasing (magnitude) intermediate outputs translates directly into an increased
cost. So how do we fix this? We want restrict the magnitude of the intermediate outputs of our network so they don’t increase so drastically at each layer.
Smaller Initial Weights
Recall that each convolution operation takes the dot product of our weights with a portion of the input. Basically, we’re multiplying and adding up a bunch of numbers similar to the following:
w0*i0 + w1*i1 + w2*i2 + … wn*in = output
- wx – Represents a single weight
- ix – Represents a single input (eg. pixel)
- n – The number of weights
One way to reduce the magnitude of this expression is to reduce the magnitude of all of our weights by some factor:
0.01*w0*i0 + 0.01*w1*i1 + 0.01*w2*i2 + … 0.01*wn*in = 0.01*output
Let’s try it and see if it works! We’ll modify the creation of our weights by multiplying them all by
layer1_weights would now be defined as:
After changing all five sets of weights (don’t forget about the fully-connected layer at the end), we can run our network and see the following test cost and accuracies:
Test Cost: 2.3025865221 Test accuracy: 5.01999998465 %
Yikes! The cost has decreased quite a bit, but that accuracy is abysmal… What’s going on this time? Let’s take a look at the intermediate outputs of the network:
If you look closely at the scales, you’ll see that this time the intermediate outputs are decreasing! The first layer’s outputs lie largely within the interval
[-0.02, 0.02] while the fourth layer generates outputs that lie within
[-0.0002, 0.0002]. This is essentially the opposite of the problem we saw before.
Let’s also examine the
predictions as we did before:
This time the
logits vary over a very small interval
[-0.003, 0.003] and predictions are completely uniform. The predictions appear to be centered around
0.10 which seems to indicate that our network is simply predicting each of the ten digits with 10% probability. In other words, our network is learning nothing at all and we’re in an even worse state than before!
Choosing the Perfect Initial Weights
What we’ve learned so far:
- Large initial weights lead to very large output in intermediate layers and an over-confident network.
- Small initial weights lead to very small output in intermediate layers and a network that doesn’t learn anything.
So how do we choose initial weights that are not too small and not too large? In 2013, Xavier Glorot and Yoshua Bengio published Understanding the difficulty of training deep forward neural networks in which they proposed initializing a set of weights based on how many input and output nerons are present for a given weight. For more on this initialization scheme see An Explanation of Xavier Initialization. This initialization scheme is called Xavier Initialization.
It turns out that Xavier Initialization does not work for layers using the asymmetric ReLU activation function. So while we can use it on our fully connected layer we can’t use it for our intermediate layers. However in 2015 Microsoft Research (Kaiming He et al.) published Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In this paper they introduced a modified version of Xavier Initialization called Variance Scaling Initialization.
The math behind these initialization schemes is out of scope for this post, but TensorFlow makes them easy to use. I recommend simply remembering:
- Use Xavier Initialization in the fully-connected layers of your network. (Or layers that use softmax/tanh activation functions)
- Use Variance Scaling Initialization in the intermediate layer of your network that use ReLU activation functions.
We can modify the initialization of
tf.random.normal to use
tf.contrib.layers.variance_scaling_initializer() as follows:
We can also modify the fully connected layer’s weights to use
tf.contrib.xavier_initializer as follows:
There are a few small changes to note here. First, we use
tf.get_variable instead of calling
tf.Variable directly. This allows us to pass in a custom initializer for our weights. Second, we have to provide a unique name for our variable. Typically I just use the same name as my variable name.
If we continue changing all the weights in our network and run it, we can see the following output:
Cost: 2.49579 Accuracy: 9.00000035763 % Cost: 1.05762 Accuracy: 77.999997139 % ... Cost: 0.110656 Accuracy: 94.9999988079 % Test Cost: 0.0945288215741 Test accuracy: 97.2900004387 %
Much better! This is a big improvement over our previous results and we can see that both cost and accuracy have improved substantially. For the sake of curiosity, let’s look at the intermediate outputs of our network:
This looks much better. The variance of the intermediate values appears to increase only slightly as we move through the layers and all values are within about an order of magnitude of one another. While we can’t make any claims about the intermediate outputs being “perfect” or even “good”, we can at least rest assured that there is no glaringly obvious problems with them. (Sidenote: This seems to be a common theme in deep learning: We usually can’t prove we’ve done things correctly, we can only look for signs that we’ve done them incorrectly).
Thoughts on Weights
Hopefully I’ve managed to convince you of the importance of choosing good initial weights for a neural network. Fortunately when it comes to image recognition, there are well-known initialization schemes that pretty much solve this problem for us.
The problems with weight initialization should highlight the fragility of deep neural networks. After all, we would hope that even if we choose poor initial weights, after enough time our gradient descent optimizer would manage to correct them and settle on good values for our weights. Unfortunately that doesn’t seem to be the case, and our optimizer instead settles into a relatively poor local minima.