In the last post, we saw our network achieve about 60% accuracy. One common way to improve a neural network’s performance is to make it deeper. Before we start adding layers to our network, it’s worth taking a moment to explore one of the key advantages of deep neural networks.
Historically, a lot of effort was invested in crafting hand-engineered features that could be fed to shallow networks (or other learning algorithms). In image detection we might modify the input to highlight horizontal or vertical edges. In voice recognition we might filter out noise or various frequencies not typically found in human speech. Unfortunately, hand-engineering features often required years of expertise and lots of time.
Below is a network created with TensorFlow Playground that demonstrates this point. By feeding modified versions of the input to a shallow network, we are able to train it to recognize a non-linear spiral pattern.
A shallow network is capable of learning complex patterns only when fed modified versions of the input. A key idea behind deep learning is to do away with hand-engineered features whenever possible. Instead, by making the network deeper, we can convince the network to learn the features it really needs to solve the problem. In image recognition, the first few layers of the network learn to recognize simple features (eg. edge detection), while deeper layers respond to more complex features (eg. human faces). Below, we’ve made the network deeper and removed all dependencies on additional features.
Making our network deeper
Let’s try making our network deeper by adding two more layers. We’ll replace
layer1_bias with the following:
Note: When discussing the network’s shapes, I ignore the batch dimension. For example, where a shape is
[None, 784] I will refer to it as a vector with 784 elements. I find it helps to imagine a batch size of 1 to avoid having to think about more complex shapes.
The first thing to notice is the change in shape.
layer1 now accepts an input of 784 values and produces an intermediate vector
layer1_output with 500 elements. We then take these 500 values through
layer2 which also produces an intermediate vector
layer2_output with 500 elements. Finally, we take these 500 values through
layer3 and produce our
logit vector with 10 elements.
Why did I choose 500 elements? No reason, it was just an arbitrary value that seemed to work. If you’re following along at home, you could try adding more layers or making them wider (ie. use a size larger than 500).
Another important change is the addition of
layer2. Note that it is applied to the result of the matrix multiplication of the previous layer’s output with the current layer’s weights.
So what is a ReLU? ReLU stands for “Rectified Linear Unit” and is an activation function. An activation function is applied to the output of each layer of a neural network. It turns out that if we don’t include activation functions, it can be mathematically shown (by people much smarter than me) that our three layer network is equivalent to a single layer network. This is obviously a BadThing™ as it means we lose all the advantages of building a deep neural network.
I’m (very obviously) glossing over the details here, so if you’re new to neural networks and want to learn more see: Why do you need non-linear activation functions?
Other historical activation functions include sigmoid and tanh. These days, ReLU is almost always the right choice of activation function and we’ll be using it exclusively for our networks.
Finally, one other small change needs to be made: The learning rate needs to be changed from
0.0001. Learning rate is one of the most important, but most finicky hyperparameters to choose when training your network. Too small and the network takes a very long time to train, too large and your network doesn’t converge. In later posts we’ll look at methods that can help with this, but for now I’ve just used the ol’ fashioned “Guess and Check” method until I found a learning rate that worked well.
Alchemy of Hyperparameters
We’ve started to see a few hyperparameters that we must choose when building a neural network:
- Number of layers
- Width of layers
- Learning rate
It’s an uncomfortable reality that we have no good way to choose values for these hyperparameters. What’s worse is that we typically can’t explain why a certain hyperparameter value works well and others do not. The only reassurance I can offer is:
- Other people think this is a problem
- As you build more networks, you’ll develop a rough intuition for choosing hyperparameter values
Putting it all together
Now that we’ve chosen a learning rate and created more intermediate layers, let’s put it all together and see how our network performs.
After running this code you should see output similar to:
Cost: 4596.864 Accuracy: 7.999999821186066 % Cost: 882.4881 Accuracy: 30.000001192092896 % Cost: 609.4177 Accuracy: 51.99999809265137 % Cost: 494.5303 Accuracy: 56.00000023841858 % ... Cost: 57.793114 Accuracy: 89.99999761581421 % Cost: 148.92995 Accuracy: 81.00000023841858 % Cost: 67.42319 Accuracy: 89.99999761581421 % Test Cost: 107.98408660641905 Test accuracy: 85.74999994039536 %
Our network has improved from 60% accuracy to 85% accuracy. This is great progress, clearly things are moving in the right direction! Next week we’ll look at a more complicated neural network structure called a “Convolutional Neural Network” which is one of the basic building blocks of today’s top image classifiers.
For the sake of completeness, I’ve included a TensorBoard visualization of the network we’ve created below: