LTFN 3: Deeper Networks

Part of the series Learn TensorFlow Now

In the last post, we saw our network achieve about 60% accuracy. One common way to improve a neural network’s performance is to make it deeper. Before we start adding layers to our network, it’s worth taking a moment to explore one of the key advantages of deep neural networks.

Historically, a lot of effort was invested in crafting hand-engineered features that could be fed to shallow networks (or other learning algorithms). In image detection we might modify the input to highlight horizontal or vertical edges. In voice recognition we might filter out noise or various frequencies not typically found in human speech. Unfortunately, hand-engineering features often required years of expertise and lots of time.

Below is a network created with TensorFlow Playground that demonstrates this point. By feeding modified versions of the input to a shallow network, we are able to train it to recognize a non-linear spiral pattern.

A shallow network requires various modifications to the input features to classify the “Swiss Roll” problem.

A shallow network is capable of learning complex patterns only when fed modified versions of the input. A key idea behind deep learning is to do away with hand-engineered features whenever possible. Instead, by making the network deeper, we can convince the network to learn the features it really needs to solve the problem. In image recognition, the first few layers of the network learn to recognize simple features (eg. edge detection), while deeper layers respond to more complex features (eg. human faces). Below, we’ve made the network deeper and removed all dependencies on additional features.

A deep network is capable of classifying the points in a “Swiss Roll” using only the original input.

 

Making our network deeper

Let’s try making our network deeper by adding two more layers. We’ll replace layer1_weights and layer1_bias with the following:

Note: When discussing the network’s shapes, I ignore the batch dimension. For example, where a shape is [None, 784] I will refer to it as a vector with 784 elements. I find it helps to imagine a batch size of 1 to avoid having to think about more complex shapes.

The first thing to notice is the change in shape. layer1 now accepts an input of 784 values and produces an intermediate vector layer1_output with 500 elements. We then take these 500 values through layer2 which also produces an intermediate vector layer2_output with 500 elements. Finally, we take these 500 values through layer3 and produce our logit vector with 10 elements.

Why did I choose 500 elements? No reason, it was just an arbitrary value that seemed to work. If you’re following along at home, you could try adding more layers or making them wider (ie. use a size larger than 500).

ReLU

Another important change is the addition of tf.nn.relu() in layer1 and layer2. Note that it is applied to the result of the matrix multiplication of the previous layer’s output with the current layer’s weights.

So what is a ReLU? ReLU stands for “Rectified Linear Unit” and is an activation function. An activation function is applied to the output of each layer of a neural network. It turns out that if we don’t include activation functions, it can be mathematically shown (by people much smarter than me) that our three layer network is equivalent to a single layer network. This is obviously a BadThing™ as it means we lose all the advantages of building a deep neural network.

I’m (very obviously) glossing over the details here, so if you’re new to neural networks and want to learn more see: Why do you need non-linear activation functions?

Other historical activation functions include sigmoid and tanh. These days, ReLU is almost always the right choice of activation function and we’ll be using it exclusively for our networks.

Graphs for ReLU, sigmoid and tanh functions

 

Learning Rate

Finally, one other small change needs to be made: The learning rate needs to be changed from 0.01 to 0.0001. Learning rate is one of the most important, but most finicky hyperparameters to choose when training your network. Too small and the network takes a very long time to train, too large and your network doesn’t converge. In later posts we’ll look at methods that can help with this, but for now I’ve just used the ol’ fashioned “Guess and Check” method until I found a learning rate that worked well.

 

Alchemy of Hyperparameters

We’ve started to see a few hyperparameters that we must choose when building a neural network:

  • Number of layers
  • Width of layers
  • Learning rate

It’s an uncomfortable reality that we have no good way to choose values for these hyperparameters. What’s worse is that we typically can’t explain why a certain hyperparameter value works well and others do not. The only reassurance I can offer is:

  1. Other people think this is a problem
  2. As you build more networks, you’ll develop a rough intuition for choosing hyperparameter values

 

Putting it all together

Now that we’ve chosen a learning rate and created more intermediate layers, let’s put it all together and see how our network performs.

 

 

 

After running this code you should see output similar to:

Cost:  4596.864
Accuracy:  7.999999821186066 %
Cost:  882.4881
Accuracy:  30.000001192092896 %
Cost:  609.4177
Accuracy:  51.99999809265137 %
Cost:  494.5303
Accuracy:  56.00000023841858 %

...

Cost:  57.793114
Accuracy:  89.99999761581421 %
Cost:  148.92995
Accuracy:  81.00000023841858 %
Cost:  67.42319
Accuracy:  89.99999761581421 %
Test Cost:  107.98408660641905
Test accuracy:  85.74999994039536 %

 

Our network has improved from 60% accuracy to 85% accuracy. This is great progress, clearly things are moving in the right direction! Next week we’ll look at a more complicated neural network structure called a “Convolutional Neural Network” which is one of the basic building blocks of today’s top image classifiers.

For the sake of completeness, I’ve included a TensorBoard visualization of the network we’ve created below:

Visualization of our three-layer network with `layer1` expanded. Notice the addition of `layer1_output` following the addition with `layer1_bias`. This represents the ReLU activation function.

 

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s