# LTFN 4: Intro to Convolutional Neural Networks

Part of the series Learn TensorFlow Now

The neural networks we’ve built so far have had a relatively simple structure. The input to each layer is fully connected to the output of the previous layer. For this reason, these layers are commonly called fully connected layers.

This has been mathematically convenient because we’ve been able to represent each layer’s output as a matrix multiplication of the previous layer’s output (a vector) with the current layer’s weights.

However, as we build more complex networks for image recognition, there are certain properties we want that are difficult to get from fully connected layers. Some of these properties include:

1. Translational Invariance – A fancy phrase for “A network trained to recognize cats should recognize cats equally well if they’re in the top left of the picture or the bottom right of the picture”. If we move the cat around the image, we should still expect to recognize it. Translational invariance suggests we should recognize objects regardless of where they’re located in the image.
2. Local Connectivity – This means that we should take advantage of features within a certain area of the image. Remember that in previous posts we treated the input as a single row of pixels. This meant that local features (e.g. edges, curves, loops) are very hard for our networks to identify and pick out. Ideally our network should try to identify patterns than occur within local regions of the image and use these patterns to influence its predictions.

Today we’re going look at one of the most successful classes of neural networks: Convolutional Neural Networks. Convolutional Neural Networks have been shown to give us both translational invariance and local connectivity.

The building block of a convolutional neural network is a convolutional filter. It is a square (typically `3x3`) set of weights. The convolutional filter looks at pieces of the input of the same shape. As it does, it takes the dot product of the weights with the input and saves the result in the output. The convolutional filter is dragged along the entire input until the entire input has been covered. Below is a simple example with a (random) `5x5` input and a (random) `3x3` filter.

So why is this useful? Consider the following examples with a vertical line in the input and a 3×3 filter with weights chosen specifically to detect vertical edges.

We can see that with hand-picked weights, we’re able to generate patterns in the output. In this example, light-to-dark transitions produce large positive values while dark-to-light transitions produce large negative values. Where there is no change at all, the filter will simply produce zeroes.

While we’ve chosen the above filter’s weights manually, it turns out that training our network via gradient descent ends up selecting very good weights for these filters. As we add more convolutional layers to our network they begin to be able to recognize more abstract concepts such as faces, whiskers, wheels etc.

You may have noticed that the output above has a smaller width and height than the original input. If we pass this output to another convolutional layer it will continue to shrink. Without dealing with this shrinkage, we’ll find that this puts an upper bound on how many convolutional layers we can have in our network.

The most common way to deal with this shrinkage is to pad the entire image with enough zeroes such that the output shape will have the same width and height as the input. This is called `SAME` padding and allows us to continue passing the output to more and more convolutional layers without worrying about shrinking width and height dimensions. Below we take our first example (5×5 input) and pad it with zeroes to make sure the output is still 5×5.

`VALID` padding does not pad the input with anything. It probably would have made more sense to call it `NO` padding or `NONE` padding.

## Stride

So far we’ve been moving the convolutional filter across the input one pixel at a time. In other words, we’ve been using a `stride=1`. Stride refers to the number of pixels we move the filter in the width and height dimension every time we compute a dot-product.  The most common stride value is `stride=1`, but certain algorithms require larger stride values. Below is an example using `stride=2`.

Notice that larger stride values result in larger decreases in output height and width. Occasionally this is desirable near the start of a network when working with larger images. Smaller input width and height can make the calculations more manageable in deeper layers of the network.

## Input Depth

In our previous examples we’ve been working with inputs that have variable height and width dimensions, but no depth dimension. However, some images (e.g. RGB) have depth, and we need some way to account for it. The key is to extend our filter’s depth dimension to match the depth dimension of the input.

Unfortunately, I lack the animation skills to properly show an animated example of this, but the following image may help: Convolution over an input with a depth of 2 using a single filter with a depth of 2.

Above we have an input of size `5x5x2` and a single filter of size `3x3x2`. The filter is dragged across the input and once again the dot product is taken at each point. The difference here is that there are `18` values being added up at each point (`9` from each depth of the input image). The result is an output with a single depth dimension.

## Output Depth

We can also control the output depth by stacking up multiple convolutional filters. Each filter acts independently of one another while computing its results and then all of the results are stacked together to create the ouptut. This means we can control output depth simply by adding or removing convolutional filters. Two convolutional filters result in a output depth of two.

It’s very important to note that there are two distinct convolutional filters above. The weights of each convolutional filter are distinct from the weights of the other convolutional filter. Each of these two filters has a shape of `3x3x2`. If we wanted to get a deeper output, we could continue stacking more of these `3x3x2` filters on top of one another.

Imagine for a moment that we stacked four convolutional filters on top of one another, each with a set of weights trained to recognize different patterns. One might recognize horizontal edges, one might recognize vertical edges, one might recognize diagonal edges from top-left to bottom-right and one might recognize diagonal edges from bottom-left to top-right. Each of these filters would produce one depth layer of the output with values where their respective edges were detected. Later layers of our network would be able to act on this information and build up even more complex representations of the input.

## Next up

There is a lot to process in this post. We’ve seen a brand new building block for our neural networks called the convolutional filter and a myriad of ways to customize it. In the next post we’ll implement our first convolutional neural network in TensorFlow and try to better understand practical ways to use this building block to build a better digit recognizer.

# LTFN 3: Deeper Networks

Part of the series Learn TensorFlow Now

In the last post, we saw our network achieve about 60% accuracy. One common way to improve a neural network’s performance is to make it deeper. Before we start adding layers to our network, it’s worth taking a moment to explore one of the key advantages of deep neural networks.

Historically, a lot of effort was invested in crafting hand-engineered features that could be fed to shallow networks (or other learning algorithms). In image detection we might modify the input to highlight horizontal or vertical edges. In voice recognition we might filter out noise or various frequencies not typically found in human speech. Unfortunately, hand-engineering features often required years of expertise and lots of time.

Below is a network created with TensorFlow Playground that demonstrates this point. By feeding modified versions of the input to a shallow network, we are able to train it to recognize a non-linear spiral pattern. A shallow network requires various modifications to the input features to classify the “Swiss Roll” problem.

A shallow network is capable of learning complex patterns only when fed modified versions of the input. A key idea behind deep learning is to do away with hand-engineered features whenever possible. Instead, by making the network deeper, we can convince the network to learn the features it really needs to solve the problem. In image recognition, the first few layers of the network learn to recognize simple features (eg. edge detection), while deeper layers respond to more complex features (eg. human faces). Below, we’ve made the network deeper and removed all dependencies on additional features. A deep network is capable of classifying the points in a “Swiss Roll” using only the original input.

## Making our network deeper

Let’s try making our network deeper by adding two more layers. We’ll replace `layer1_weights` and `layer1_bias` with the following:

Note: When discussing the network’s shapes, I ignore the batch dimension. For example, where a shape is `[None, 784]` I will refer to it as a vector with 784 elements. I find it helps to imagine a batch size of 1 to avoid having to think about more complex shapes.

The first thing to notice is the change in shape. `layer1` now accepts an input of 784 values and produces an intermediate vector `layer1_output` with 500 elements. We then take these 500 values through `layer2` which also produces an intermediate vector `layer2_output` with 500 elements. Finally, we take these 500 values through `layer3` and produce our `logit` vector with 10 elements.

Why did I choose 500 elements? No reason, it was just an arbitrary value that seemed to work. If you’re following along at home, you could try adding more layers or making them wider (ie. use a size larger than 500).

## ReLU

Another important change is the addition of `tf.nn.relu()` in `layer1` and `layer2`. Note that it is applied to the result of the matrix multiplication of the previous layer’s output with the current layer’s weights.

So what is a ReLU? ReLU stands for “Rectified Linear Unit” and is an activation function. An activation function is applied to the output of each layer of a neural network. It turns out that if we don’t include activation functions, it can be mathematically shown (by people much smarter than me) that our three layer network is equivalent to a single layer network. This is obviously a BadThing™ as it means we lose all the advantages of building a deep neural network.

I’m (very obviously) glossing over the details here, so if you’re new to neural networks and want to learn more see: Why do you need non-linear activation functions?

Other historical activation functions include sigmoid and tanh. These days, ReLU is almost always the right choice of activation function and we’ll be using it exclusively for our networks.

## Learning Rate

Finally, one other small change needs to be made: The learning rate needs to be changed from `0.01` to `0.0001`. Learning rate is one of the most important, but most finicky hyperparameters to choose when training your network. Too small and the network takes a very long time to train, too large and your network doesn’t converge. In later posts we’ll look at methods that can help with this, but for now I’ve just used the ol’ fashioned “Guess and Check” method until I found a learning rate that worked well.

## Alchemy of Hyperparameters

We’ve started to see a few hyperparameters that we must choose when building a neural network:

• Number of layers
• Width of layers
• Learning rate

It’s an uncomfortable reality that we have no good way to choose values for these hyperparameters. What’s worse is that we typically can’t explain why a certain hyperparameter value works well and others do not. The only reassurance I can offer is:

1. Other people think this is a problem
2. As you build more networks, you’ll develop a rough intuition for choosing hyperparameter values

## Putting it all together

Now that we’ve chosen a learning rate and created more intermediate layers, let’s put it all together and see how our network performs.

After running this code you should see output similar to:

```Cost:  4596.864
Accuracy:  7.999999821186066 %
Cost:  882.4881
Accuracy:  30.000001192092896 %
Cost:  609.4177
Accuracy:  51.99999809265137 %
Cost:  494.5303
Accuracy:  56.00000023841858 %

...

Cost:  57.793114
Accuracy:  89.99999761581421 %
Cost:  148.92995
Accuracy:  81.00000023841858 %
Cost:  67.42319
Accuracy:  89.99999761581421 %
Test Cost:  107.98408660641905
Test accuracy:  85.74999994039536 %
```

Our network has improved from 60% accuracy to 85% accuracy. This is great progress, clearly things are moving in the right direction! Next week we’ll look at a more complicated neural network structure called a “Convolutional Neural Network” which is one of the basic building blocks of today’s top image classifiers.

For the sake of completeness, I’ve included a TensorBoard visualization of the network we’ve created below: Visualization of our three-layer network with `layer1` expanded. Notice the addition of `layer1_output` following the addition with `layer1_bias`. This represents the ReLU activation function.

# LTFN 2: Graphs and Shapes

Part of the series Learn TensorFlow Now

## TensorFlow Graphs

Before we improve our network, we have to take a moment to chat about TensorFlow graphs. As we saw in the previous post, we follow two steps when using TensorFlow:

1. Create a computational graph
2. Run data through the graph using `tf.Session.run()`

Let’s take a look at what’s actually happening when we call `tf.Session.run()`. Consider our graph and session code from last time: When we pass `optimizer` and `cost` to `session.run()`, TensorFlow looks at the dependencies for these two nodes. For example, we can see above that `optimizer` depends on:

• `cost`
• `layer1_weights`
• `layer1_bias`
• `input`

We can also see that `cost` depends on:

• `logits`
• `labels`

When we wish to evaluate `optimizer` and `cost`, TensorFlow first runs all the operations defined by the previous nodes, then calculates the required results and returns them. Since every node ends up being a dependency of `optimizer` and `cost`, this means that every operation in our TensorFlow graph is executed with every call to `session.run()`.

But what if we don’t want to run every operation? If we want to pass test data to our network, we don’t want to run the operations defined by `optimizer`. (After all, we don’t want to train our network on our test set!) Instead, we’d just want to extract predictions from `logits`. In that case, we could instead run our network as follows:

This would execute only the subset of nodes required to compute the values of `logits`, highlighted below: Our computational graph with only dependencies of logits highlighted in orange.

Note: As `labels` is not one of the dependencies of `logits` we don’t need to provide it.

Understanding the dependencies of the computational graphs we create is important. We should always try to be aware of exactly what operations will be running when we call `session.run()` to avoid accidentally running the wrong operations.

## Shapes

Another important topic to understand is how TensorFlow shapes work. In our previous post all our shapes were completely defined. Consider the following `tf.Placeholders` for `input` and `labels`:

We have defined these tensors to have a 2-D shape of precisely `(100, 784)` and `(100, 10)`. This restricts us to a computational graph that always expects 100 images at a time. What if we have a training set that isn’t divisible by 100? What if we want to test on single images?

The answer is to use dynamic shapes. In places where we’re not sure what shape we would like to support, we just substitute in `None`. For example, if we want to allow variable batch sizes, we simply write:

Now we can pass in batch sizes of 1, 10, 283 or any other size we’d like. From this point on, we’ll be defining all of our `tf.Placeholders` in this fashion.

## Accuracy

One important question remains: “How well is our network doing?“. In the previous post, we saw `cost` decreasing, but we had no concrete metric against which we could compare our network. We’ll keep things simple and use accuracy as our metric. We just want to measure the average number of correction predictions:

In the first line, we convert `logits` to a set of predictions using `tf.nn.softmax`. Remember that our `labels` are 1-hot encoded, meaning each one contains 10 numbers, one of which is 1. `logits` is the same shape, but the values in `logits` can be almost anything. (eg. values in `logits` could be -4, 234, 0.5 and so on). We want our `predictions` to have a few qualities that `logits` does not possess:

1. The sum of the values in `predictions` for a given image should be 1
2. No values in `predictions` should be greater than 1
3. No values in `predictions` should be negative
4. The highest value in `predictions` will be our prediction for a given image. (We can use `argmax` to find this)

Applying `tf.nn.softmax()` to `logits` gives us these desired properties. For more details on softmax, watch this video by Andrew Ng.

The second line takes the `argmax` of our predictions and of our labels. Then `tf.equal` creates a vector that contains either `True` (when the values match) and `False` when the values don’t match.

Finally, we use `tf.reduce_mean` to calculate the average number of times we get the prediction correct for this batch. We store this result in `accuracy`.

## Putting it all together

Now that we better understand TensorFlow graphs, shape and have a metric with which to judge our algorithm, let’s put it all together to evaluate our performance on the test set, after training has finished.

Note that almost all of the new code relates to running the test set.

One question you might ask is: Why not just predict all the test images at once, in one big batch of 10,000? The problem is that when we train larger networks on our GPU, we won’t be able to fit all 10,000 images and the required operations in our GPU’s memory. Instead we have to process the test set in batches similar to how we train the network.

Finally, let’s run it and look at the output. When I run it on my local machine I receive the following:

```Cost:  20.207457
Accuracy:  7.999999821186066 %
Cost:  10.040323
Accuracy:  14.000000059604645 %
Cost:  8.528659
Accuracy:  14.000000059604645 %
Cost:  6.8867884
Accuracy:  23.999999463558197 %
Cost:  7.1556334
Accuracy:  21.99999988079071 %
Cost:  6.312024
Accuracy:  28.00000011920929 %
Cost:  4.679361
Accuracy:  34.00000035762787 %
Cost:  5.220028
Accuracy:  34.00000035762787 %
Cost:  5.167577
Accuracy:  23.999999463558197 %
Cost:  3.5488296
Accuracy:  40.99999964237213 %
Cost:  3.2974648
Accuracy:  43.00000071525574 %
Cost:  3.532155
Accuracy:  46.99999988079071 %
Cost:  2.9645846
Accuracy:  56.00000023841858 %
Cost:  3.0816755
Accuracy:  46.99999988079071 %
Cost:  3.0201495
Accuracy:  50.999999046325684 %
Cost:  2.7738256
Accuracy:  60.00000238418579 %
Cost:  2.4169116
Accuracy:  55.000001192092896 %
Cost:  1.944017
Accuracy:  60.00000238418579 %
Cost:  3.5998762
Accuracy:  50.0 %
Cost:  2.8526196
Accuracy:  55.000001192092896 %
Test Cost:  2.392377197146416
Test accuracy:  59.48999986052513 %
Press any key to continue . . .```

So we’re getting a test accuracy of ~60%. This is better than chance, but it’s not as good as we’d like it to be. In the next post, we’ll look at different ways of improving the network.

# LTFN 1: Intro to TensorFlow

Part of the series Learn TensorFlow Now

Over the next few posts, we’ll build a neural network that accurately reads handwritten digits. We’ll go step-by-step, starting with the basics of TensorFlow and ending up with one of the best networks in the ILSCRC 2013 image recognition competition.

## MNIST Dataset

The MNIST dataset is one of the simplest image datasets and makes for a perfect starting point. It consists of 70,000 images of handwritten digits. Our goal is to build a neural network that can identify the digit in a given image.

• 60,000 images in the training set
• 10,000 images in the test set
• Size: 28×28 (784 pixels)
• 1 Channel (ie. not RGB)

To start, we’ll import TensorFlow and our dataset:

TensorFlow makes it easy for us to download the MNIST dataset and save it locally. Our data has been split into a training set on which our network will learn and a test set against which we’ll check how well we’ve done.

Note: The labels are represented using one-hot encoding which means:

`0` is represented by `1 0 0 0 0 0 0 0 0 0`
`1` is represented by `0 1 0 0 0 0 0 0 0 0`

`9` is represented by `0 0 0 0 0 0 0 0 0 1`

Note: By default, the images are represented as arrays of 784 values. Below is a sample of what this might look like for a given image: ## TensorFlow Graphs

There are two steps to follow when training our own neural networks with TensorFlow:

1. Create a computational graph
2. Run data through the graph so our network can learn or make predictions

### Creating a Computational Graph

We’ll start by creating the simplest possible computational graph. Notice in the following code that there is nothing that touches the actual MNIST data. We are simply creating a computational graph so that we may later feed our data to it.

For first-time TensorFlow users there’s a lot to unpack in the next few lines, so we’ll take it slow.

Before explaining anything, let’s take a quick look at the network we’ve created. Below are two different visualizations of this network at different granularities that tell slightly different stories about what we’ve created. Left: A functional visualization of our single layer network. The 784 `input` values are each multiplied by a weight which feeds into our ten `logits`.
Right: The graph created by TensorFlow, including nodes that represent our `optimizer` and `cost`.The first two lines of our code simply define a TensorFlow graph and tell TensorFlow that all the following operations we define should be included in this graph.

Next, we use `tf.Placeholder` to create two “Placeholder” nodes in our graph. These are nodes for which we’ll provide values every time we run our network. Our placeholders are:

• `input` which will contain batches of 100 images, each with 784 values
• `labels` which will contain batches of 100 labels, each with 10 values

Next we use `tf.Variable` to create two new nodes, `layer1_weights` and `layer1_biases`. These represent parameters that the network will adjust as we show it more and more examples. To start, we’ve made `layer1_weights` completely random, and `layer1_biases` all zero. As we learn more about neural networks, we’ll see that these aren’t the greatest choice, but they’ll work for now.

After creating our weights, we’ll combine them using `tf.matmul` to matrix multiply them against our input and `+` to add this result to our bias. You should note that `+` is simply a convenient shorthand for `tf.add`.  We store the result of this operation in `logits` and will consider the output node with the highest value to be our network’s prediction for a given example.

Now that we’ve got our predictions, we want to compare them to the labels and determine how far off we were. We’ll do this by taking the softmax of our output and then use cross entropy as our measure of “loss” or `cost`. We can perform both of these steps using `tf.nn.softmax_cross_entropy_with_logits`. Now we’ve got a measure of loss for all the examples in our batch, so we’ll just take the mean of these as our final `cost`.

The final step is to define an `optimizer`. This creates a node that is responsible for automatically updating the `tf.Variables` (weights and biases) of our network in an effort to minimize `cost`. We’re going to use the vanilla of optimizers: `tf.train.GradientDescentOptimizer`. Note that we have to provide a `learning_rate` to our optimizer. Choosing an appropriate learning rate is one of the difficult parts of training any new network. For now we’ll arbitrarily use 0.01 because it seems to work reasonably well.

### Running our Neural Network

Now that we’ve created the network it’s time to actually run it. We’ll pass 100 images and labels to our network and watch as the cost decreases.

The first line creates a TensorFlow Session for our `graph`. The `session` is used to actually run the operations defined in our graph and produce results for us.

The second line initializes all of our `tf.Variables`. In our example, this means choosing random values for `layer1_weights` and setting `layer1_bias` to all zeros.

Next, we create a loop that will run for 1,000 training steps with a `batch_size` of 100. The first three lines of the loop simply select out 100 images and labels at a time. We store `batch_images` and `batch_labels` in `feed_dict`. Note that the keys of this dictionary `input` and `labels` correspond to the `tf.Placeholder` nodes we defined when creating our graph. These names must match, and all placeholders must have a corresponding entry in `feed_dict`.

Finally, we run the network using `session.run` where we pass in `feed_dict`. Notice that we also pass in `optimizer` and `cost`. This tells TensorFlow to evaluate these nodes and to store the results from the current run in `o` and `c`. In the next post, we’ll touch more on this method, and how TensorFlow executes operations based on the nodes we supply to it here.

## Results

Now that we’ve put it all together, let’s look at the (truncated) output:

``` Cost: 12.673884
Cost: 11.534428
Cost: 8.510129
Cost: 9.842179
Cost: 11.445622
Cost: 8.554568
Cost: 9.342157
...
Cost: 4.811098
Cost: 4.2431364
Cost: 3.4888883
Cost: 3.8150232
Cost: 4.206609
Cost: 3.2540445```

Clearly the cost is going down, but we still have many unanswered questions:

• What is the accuracy of our trained network?
• How do we know when to stop training? Was 1,000 steps enough?
• How can we improve our network?
• How can we see what its predictions actually were?

We’ll explore these questions in the next few posts as we seek to improve our performance.