LTFN 4: Intro to Convolutional Neural Networks

Part of the series Learn TensorFlow Now

The neural networks we’ve built so far have had a relatively simple structure. The input to each layer is fully connected to the output of the previous layer. For this reason, these layers are commonly called fully connected layers.

Two fully connected layers in a neural network.

This has been mathematically convenient because we’ve been able to represent each layer’s output as a matrix multiplication of the previous layer’s output (a vector) with the current layer’s weights.

However, as we build more complex networks for image recognition, there are certain properties we want that are difficult to get from fully connected layers. Some of these properties include:

  1. Translational Invariance – A fancy phrase for “A network trained to recognize cats should recognize cats equally well if they’re in the top left of the picture or the bottom right of the picture”. If we move the cat around the image, we should still expect to recognize it.

    Translational invariance suggests we should recognize objects regardless of where they’re located in the image.
  2. Local Connectivity – This means that we should take advantage of features within a certain area of the image. Remember that in previous posts we treated the input as a single row of pixels. This meant that local features (e.g. edges, curves, loops) are very hard for our networks to identify and pick out. Ideally our network should try to identify patterns than occur within local regions of the image and use these patterns to influence its predictions.


Today we’re going look at one of the most successful classes of neural networks: Convolutional Neural Networks. Convolutional Neural Networks have been shown to give us both translational invariance and local connectivity.

The building block of a convolutional neural network is a convolutional filter. It is a square (typically 3x3) set of weights. The convolutional filter looks at pieces of the input of the same shape. As it does, it takes the dot product of the weights with the input and saves the result in the output. The convolutional filter is dragged along the entire input until the entire input has been covered. Below is a simple example with a (random) 5x5 input and a (random) 3x3 filter.


So why is this useful? Consider the following examples with a vertical line in the input and a 3×3 filter with weights chosen specifically to detect vertical edges.

Vertical edge detection from light-to-dark.
Vertical edge detection from dark-to-light.

We can see that with hand-picked weights, we’re able to generate patterns in the output. In this example, light-to-dark transitions produce large positive values while dark-to-light transitions produce large negative values. Where there is no change at all, the filter will simply produce zeroes.

While we’ve chosen the above filter’s weights manually, it turns out that training our network via gradient descent ends up selecting very good weights for these filters. As we add more convolutional layers to our network they begin to be able to recognize more abstract concepts such as faces, whiskers, wheels etc.



You may have noticed that the output above has a smaller width and height than the original input. If we pass this output to another convolutional layer it will continue to shrink. Without dealing with this shrinkage, we’ll find that this puts an upper bound on how many convolutional layers we can have in our network.

SAME Padding

The most common way to deal with this shrinkage is to pad the entire image with enough zeroes such that the output shape will have the same width and height as the input. This is called SAME padding and allows us to continue passing the output to more and more convolutional layers without worrying about shrinking width and height dimensions. Below we take our first example (5×5 input) and pad it with zeroes to make sure the output is still 5×5.


A 5x5 input padded with zeroes to generate a 5x5 output.

VALID Padding

VALID padding does not pad the input with anything. It probably would have made more sense to call it NO padding or NONE padding.

VALID padding results in shrinkage in width and height.



So far we’ve been moving the convolutional filter across the input one pixel at a time. In other words, we’ve been using a stride=1. Stride refers to the number of pixels we move the filter in the width and height dimension every time we compute a dot-product.  The most common stride value is stride=1, but certain algorithms require larger stride values. Below is an example using stride=2.

Notice that larger stride values result in larger decreases in output height and width. Occasionally this is desirable near the start of a network when working with larger images. Smaller input width and height can make the calculations more manageable in deeper layers of the network.


Input Depth

In our previous examples we’ve been working with inputs that have variable height and width dimensions, but no depth dimension. However, some images (e.g. RGB) have depth, and we need some way to account for it. The key is to extend our filter’s depth dimension to match the depth dimension of the input.

Unfortunately, I lack the animation skills to properly show an animated example of this, but the following image may help:

Convolution over an input with a depth of 2 using a single filter with a depth of 2.

Above we have an input of size 5x5x2 and a single filter of size 3x3x2. The filter is dragged across the input and once again the dot product is taken at each point. The difference here is that there are 18 values being added up at each point (9 from each depth of the input image). The result is an output with a single depth dimension.


Output Depth

We can also control the output depth by stacking up multiple convolutional filters. Each filter acts independently of one another while computing its results and then all of the results are stacked together to create the ouptut. This means we can control output depth simply by adding or removing convolutional filters.

Two convolutional filters result in a output depth of two.

It’s very important to note that there are two distinct convolutional filters above. The weights of each convolutional filter are distinct from the weights of the other convolutional filter. Each of these two filters has a shape of 3x3x2. If we wanted to get a deeper output, we could continue stacking more of these 3x3x2 filters on top of one another.

Imagine for a moment that we stacked four convolutional filters on top of one another, each with a set of weights trained to recognize different patterns. One might recognize horizontal edges, one might recognize vertical edges, one might recognize diagonal edges from top-left to bottom-right and one might recognize diagonal edges from bottom-left to top-right. Each of these filters would produce one depth layer of the output with values where their respective edges were detected. Later layers of our network would be able to act on this information and build up even more complex representations of the input.


Next up

There is a lot to process in this post. We’ve seen a brand new building block for our neural networks called the convolutional filter and a myriad of ways to customize it. In the next post we’ll implement our first convolutional neural network in TensorFlow and try to better understand practical ways to use this building block to build a better digit recognizer.

LTFN 3: Deeper Networks

Part of the series Learn TensorFlow Now

In the last post, we saw our network achieve about 60% accuracy. One common way to improve a neural network’s performance is to make it deeper. Before we start adding layers to our network, it’s worth taking a moment to explore one of the key advantages of deep neural networks.

Historically, a lot of effort was invested in crafting hand-engineered features that could be fed to shallow networks (or other learning algorithms). In image detection we might modify the input to highlight horizontal or vertical edges. In voice recognition we might filter out noise or various frequencies not typically found in human speech. Unfortunately, hand-engineering features often required years of expertise and lots of time.

Below is a network created with TensorFlow Playground that demonstrates this point. By feeding modified versions of the input to a shallow network, we are able to train it to recognize a non-linear spiral pattern.

A shallow network requires various modifications to the input features to classify the “Swiss Roll” problem.

A shallow network is capable of learning complex patterns only when fed modified versions of the input. A key idea behind deep learning is to do away with hand-engineered features whenever possible. Instead, by making the network deeper, we can convince the network to learn the features it really needs to solve the problem. In image recognition, the first few layers of the network learn to recognize simple features (eg. edge detection), while deeper layers respond to more complex features (eg. human faces). Below, we’ve made the network deeper and removed all dependencies on additional features.

A deep network is capable of classifying the points in a “Swiss Roll” using only the original input.


Making our network deeper

Let’s try making our network deeper by adding two more layers. We’ll replace layer1_weights and layer1_bias with the following:

Note: When discussing the network’s shapes, I ignore the batch dimension. For example, where a shape is [None, 784] I will refer to it as a vector with 784 elements. I find it helps to imagine a batch size of 1 to avoid having to think about more complex shapes.

The first thing to notice is the change in shape. layer1 now accepts an input of 784 values and produces an intermediate vector layer1_output with 500 elements. We then take these 500 values through layer2 which also produces an intermediate vector layer2_output with 500 elements. Finally, we take these 500 values through layer3 and produce our logit vector with 10 elements.

Why did I choose 500 elements? No reason, it was just an arbitrary value that seemed to work. If you’re following along at home, you could try adding more layers or making them wider (ie. use a size larger than 500).


Another important change is the addition of tf.nn.relu() in layer1 and layer2. Note that it is applied to the result of the matrix multiplication of the previous layer’s output with the current layer’s weights.

So what is a ReLU? ReLU stands for “Rectified Linear Unit” and is an activation function. An activation function is applied to the output of each layer of a neural network. It turns out that if we don’t include activation functions, it can be mathematically shown (by people much smarter than me) that our three layer network is equivalent to a single layer network. This is obviously a BadThing™ as it means we lose all the advantages of building a deep neural network.

I’m (very obviously) glossing over the details here, so if you’re new to neural networks and want to learn more see: Why do you need non-linear activation functions?

Other historical activation functions include sigmoid and tanh. These days, ReLU is almost always the right choice of activation function and we’ll be using it exclusively for our networks.

Graphs for ReLU, sigmoid and tanh functions


Learning Rate

Finally, one other small change needs to be made: The learning rate needs to be changed from 0.01 to 0.0001. Learning rate is one of the most important, but most finicky hyperparameters to choose when training your network. Too small and the network takes a very long time to train, too large and your network doesn’t converge. In later posts we’ll look at methods that can help with this, but for now I’ve just used the ol’ fashioned “Guess and Check” method until I found a learning rate that worked well.


Alchemy of Hyperparameters

We’ve started to see a few hyperparameters that we must choose when building a neural network:

  • Number of layers
  • Width of layers
  • Learning rate

It’s an uncomfortable reality that we have no good way to choose values for these hyperparameters. What’s worse is that we typically can’t explain why a certain hyperparameter value works well and others do not. The only reassurance I can offer is:

  1. Other people think this is a problem
  2. As you build more networks, you’ll develop a rough intuition for choosing hyperparameter values


Putting it all together

Now that we’ve chosen a learning rate and created more intermediate layers, let’s put it all together and see how our network performs.




After running this code you should see output similar to:

Cost:  4596.864
Accuracy:  7.999999821186066 %
Cost:  882.4881
Accuracy:  30.000001192092896 %
Cost:  609.4177
Accuracy:  51.99999809265137 %
Cost:  494.5303
Accuracy:  56.00000023841858 %


Cost:  57.793114
Accuracy:  89.99999761581421 %
Cost:  148.92995
Accuracy:  81.00000023841858 %
Cost:  67.42319
Accuracy:  89.99999761581421 %
Test Cost:  107.98408660641905
Test accuracy:  85.74999994039536 %


Our network has improved from 60% accuracy to 85% accuracy. This is great progress, clearly things are moving in the right direction! Next week we’ll look at a more complicated neural network structure called a “Convolutional Neural Network” which is one of the basic building blocks of today’s top image classifiers.

For the sake of completeness, I’ve included a TensorBoard visualization of the network we’ve created below:

Visualization of our three-layer network with `layer1` expanded. Notice the addition of `layer1_output` following the addition with `layer1_bias`. This represents the ReLU activation function.





LTFN 2: Graphs and Shapes

Part of the series Learn TensorFlow Now

TensorFlow Graphs

Before we improve our network, we have to take a moment to chat about TensorFlow graphs. As we saw in the previous post, we follow two steps when using TensorFlow:

  1. Create a computational graph
  2. Run data through the graph using

Let’s take a look at what’s actually happening when we call Consider our graph and session code from last time:

When we pass optimizer and cost to, TensorFlow looks at the dependencies for these two nodes. For example, we can see above that optimizer depends on:

  • cost
  • layer1_weights
  • layer1_bias
  • input

We can also see that cost depends on:

  • logits
  • labels

When we wish to evaluate optimizer and cost, TensorFlow first runs all the operations defined by the previous nodes, then calculates the required results and returns them. Since every node ends up being a dependency of optimizer and cost, this means that every operation in our TensorFlow graph is executed with every call to

But what if we don’t want to run every operation? If we want to pass test data to our network, we don’t want to run the operations defined by optimizer. (After all, we don’t want to train our network on our test set!) Instead, we’d just want to extract predictions from logits. In that case, we could instead run our network as follows:

This would execute only the subset of nodes required to compute the values of logits, highlighted below:

Our computational graph with only dependencies of logits highlighted in orange.

Note: As labels is not one of the dependencies of logits we don’t need to provide it.

Understanding the dependencies of the computational graphs we create is important. We should always try to be aware of exactly what operations will be running when we call to avoid accidentally running the wrong operations.



Another important topic to understand is how TensorFlow shapes work. In our previous post all our shapes were completely defined. Consider the following tf.Placeholders for input and labels:

We have defined these tensors to have a 2-D shape of precisely (100, 784) and (100, 10). This restricts us to a computational graph that always expects 100 images at a time. What if we have a training set that isn’t divisible by 100? What if we want to test on single images?

The answer is to use dynamic shapes. In places where we’re not sure what shape we would like to support, we just substitute in None. For example, if we want to allow variable batch sizes, we simply write:

Now we can pass in batch sizes of 1, 10, 283 or any other size we’d like. From this point on, we’ll be defining all of our tf.Placeholders in this fashion.



One important question remains: “How well is our network doing?“. In the previous post, we saw cost decreasing, but we had no concrete metric against which we could compare our network. We’ll keep things simple and use accuracy as our metric. We just want to measure the average number of correction predictions:

In the first line, we convert logits to a set of predictions using tf.nn.softmax. Remember that our labels are 1-hot encoded, meaning each one contains 10 numbers, one of which is 1. logits is the same shape, but the values in logits can be almost anything. (eg. values in logits could be -4, 234, 0.5 and so on). We want our predictions to have a few qualities that logits does not possess:

  1. The sum of the values in predictions for a given image should be 1
  2. No values in predictions should be greater than 1
  3. No values in predictions should be negative
  4. The highest value in predictions will be our prediction for a given image. (We can use argmax to find this)

Applying tf.nn.softmax() to logits gives us these desired properties. For more details on softmax, watch this video by Andrew Ng.

The second line takes the argmax of our predictions and of our labels. Then tf.equal creates a vector that contains either True (when the values match) and False when the values don’t match.

Finally, we use tf.reduce_mean to calculate the average number of times we get the prediction correct for this batch. We store this result in accuracy.

Putting it all together

Now that we better understand TensorFlow graphs, shape and have a metric with which to judge our algorithm, let’s put it all together to evaluate our performance on the test set, after training has finished.

Note that almost all of the new code relates to running the test set.

One question you might ask is: Why not just predict all the test images at once, in one big batch of 10,000? The problem is that when we train larger networks on our GPU, we won’t be able to fit all 10,000 images and the required operations in our GPU’s memory. Instead we have to process the test set in batches similar to how we train the network.

Finally, let’s run it and look at the output. When I run it on my local machine I receive the following:

Cost:  20.207457
Accuracy:  7.999999821186066 %
Cost:  10.040323
Accuracy:  14.000000059604645 %
Cost:  8.528659
Accuracy:  14.000000059604645 %
Cost:  6.8867884
Accuracy:  23.999999463558197 %
Cost:  7.1556334
Accuracy:  21.99999988079071 %
Cost:  6.312024
Accuracy:  28.00000011920929 %
Cost:  4.679361
Accuracy:  34.00000035762787 %
Cost:  5.220028
Accuracy:  34.00000035762787 %
Cost:  5.167577
Accuracy:  23.999999463558197 %
Cost:  3.5488296
Accuracy:  40.99999964237213 %
Cost:  3.2974648
Accuracy:  43.00000071525574 %
Cost:  3.532155
Accuracy:  46.99999988079071 %
Cost:  2.9645846
Accuracy:  56.00000023841858 %
Cost:  3.0816755
Accuracy:  46.99999988079071 %
Cost:  3.0201495
Accuracy:  50.999999046325684 %
Cost:  2.7738256
Accuracy:  60.00000238418579 %
Cost:  2.4169116
Accuracy:  55.000001192092896 %
Cost:  1.944017
Accuracy:  60.00000238418579 %
Cost:  3.5998762
Accuracy:  50.0 %
Cost:  2.8526196
Accuracy:  55.000001192092896 %
Test Cost:  2.392377197146416
Test accuracy:  59.48999986052513 %
Press any key to continue . . .

So we’re getting a test accuracy of ~60%. This is better than chance, but it’s not as good as we’d like it to be. In the next post, we’ll look at different ways of improving the network.

LTFN 1: Intro to TensorFlow

Part of the series Learn TensorFlow Now

Over the next few posts, we’ll build a neural network that accurately reads handwritten digits. We’ll go step-by-step, starting with the basics of TensorFlow and ending up with one of the best networks in the ILSCRC 2013 image recognition competition.

MNIST Dataset

The MNIST dataset is one of the simplest image datasets and makes for a perfect starting point. It consists of 70,000 images of handwritten digits. Our goal is to build a neural network that can identify the digit in a given image.

  • 60,000 images in the training set
  • 10,000 images in the test set
  • Size: 28×28 (784 pixels)
  • 1 Channel (ie. not RGB)
Sample images from MNIST

To start, we’ll import TensorFlow and our dataset:

TensorFlow makes it easy for us to download the MNIST dataset and save it locally. Our data has been split into a training set on which our network will learn and a test set against which we’ll check how well we’ve done.

Note: The labels are represented using one-hot encoding which means:

0 is represented by 1 0 0 0 0 0 0 0 0 0
1 is represented by 0 1 0 0 0 0 0 0 0 0

9 is represented by 0 0 0 0 0 0 0 0 0 1

Note: By default, the images are represented as arrays of 784 values. Below is a sample of what this might look like for a given image:

TensorFlow Graphs

There are two steps to follow when training our own neural networks with TensorFlow:

  1. Create a computational graph
  2. Run data through the graph so our network can learn or make predictions

Creating a Computational Graph

We’ll start by creating the simplest possible computational graph. Notice in the following code that there is nothing that touches the actual MNIST data. We are simply creating a computational graph so that we may later feed our data to it.

For first-time TensorFlow users there’s a lot to unpack in the next few lines, so we’ll take it slow.

Before explaining anything, let’s take a quick look at the network we’ve created. Below are two different visualizations of this network at different granularities that tell slightly different stories about what we’ve created.

Left: A functional visualization of our single layer network. The 784 input values are each multiplied by a weight which feeds into our ten logits.
Right: The graph created by TensorFlow, including nodes that represent our optimizer and cost.The first two lines of our code simply define a TensorFlow graph and tell TensorFlow that all the following operations we define should be included in this graph.

Next, we use tf.Placeholder to create two “Placeholder” nodes in our graph. These are nodes for which we’ll provide values every time we run our network. Our placeholders are:

  • input which will contain batches of 100 images, each with 784 values
  • labels which will contain batches of 100 labels, each with 10 values

Next we use tf.Variable to create two new nodes, layer1_weights and layer1_biases. These represent parameters that the network will adjust as we show it more and more examples. To start, we’ve made layer1_weights completely random, and layer1_biases all zero. As we learn more about neural networks, we’ll see that these aren’t the greatest choice, but they’ll work for now.

After creating our weights, we’ll combine them using tf.matmul to matrix multiply them against our input and + to add this result to our bias. You should note that + is simply a convenient shorthand for tf.add.  We store the result of this operation in logits and will consider the output node with the highest value to be our network’s prediction for a given example.

Now that we’ve got our predictions, we want to compare them to the labels and determine how far off we were. We’ll do this by taking the softmax of our output and then use cross entropy as our measure of “loss” or cost. We can perform both of these steps using tf.nn.softmax_cross_entropy_with_logits. Now we’ve got a measure of loss for all the examples in our batch, so we’ll just take the mean of these as our final cost.

The final step is to define an optimizer. This creates a node that is responsible for automatically updating the tf.Variables (weights and biases) of our network in an effort to minimize cost. We’re going to use the vanilla of optimizers: tf.train.GradientDescentOptimizer. Note that we have to provide a learning_rate to our optimizer. Choosing an appropriate learning rate is one of the difficult parts of training any new network. For now we’ll arbitrarily use 0.01 because it seems to work reasonably well.


Running our Neural Network

Now that we’ve created the network it’s time to actually run it. We’ll pass 100 images and labels to our network and watch as the cost decreases.

The first line creates a TensorFlow Session for our graph. The session is used to actually run the operations defined in our graph and produce results for us.

The second line initializes all of our tf.Variables. In our example, this means choosing random values for layer1_weights and setting layer1_bias to all zeros.

Next, we create a loop that will run for 1,000 training steps with a batch_size of 100. The first three lines of the loop simply select out 100 images and labels at a time. We store batch_images and batch_labels in feed_dict. Note that the keys of this dictionary input and labels correspond to the tf.Placeholder nodes we defined when creating our graph. These names must match, and all placeholders must have a corresponding entry in feed_dict.

Finally, we run the network using where we pass in feed_dict. Notice that we also pass in optimizer and cost. This tells TensorFlow to evaluate these nodes and to store the results from the current run in o and c. In the next post, we’ll touch more on this method, and how TensorFlow executes operations based on the nodes we supply to it here.


Now that we’ve put it all together, let’s look at the (truncated) output:

 Cost: 12.673884
 Cost: 11.534428
 Cost: 8.510129
 Cost: 9.842179
 Cost: 11.445622
 Cost: 8.554568
 Cost: 9.342157
 Cost: 4.811098
 Cost: 4.2431364
 Cost: 3.4888883
 Cost: 3.8150232
 Cost: 4.206609
 Cost: 3.2540445

Clearly the cost is going down, but we still have many unanswered questions:

  • What is the accuracy of our trained network?
  • How do we know when to stop training? Was 1,000 steps enough?
  • How can we improve our network?
  • How can we see what its predictions actually were?

We’ll explore these questions in the next few posts as we seek to improve our performance.

Complete source