Part of the series Learn TensorFlow Now
In the last post we looked at the building blocks of a convolutional neural net. The convolution operation works by sliding a filter along the input and taking the dot product at each location to generate an output volume.
The parameters we need to consider when building a convolutional layer are:
1. Padding – Should we pad the input with zeroes?
2. Stride – Should we move the filter more than one pixel at a time?
3. Input depth – Each convolutional filter must have a depth that matches the input depth.
4. Number of filters – We can stack multiple filters to increase the depth of the output.
With this knowledge we can construct our first convolutional neural network. We’ll start by creating a single convolutional layer that operates on a batch of input images of size
We start by creating a 4-D Tensor for
layer1_weights. This Tensor represents the weights of the various filters that will be used in our convolution and then trained via gradient descent. By default, TensorFlow uses the format
[filter_height, filter_width, in_depth, out_depth] for convolutional filters. In this example, we’re defining
64 filters each of which has a height of
3, width of
3, and an input depth of
It’s important to remember that
in_depth must always match the depth of the input we’re convolving. If our images were RGB, we would have had to create filters with a depth of
On the other hand, we can increase or decrease output depth simply by changing the value we specify for
out_depth. This represents how many independent filters we’ll create and therefore the depth of the output. In our example, we’ve specified
64 filters and we can see
layer1_conv has a corresponding depth of
Stride represents how fast we move the filter along each dimension. By default, TensorFlow expects stride to be defined in terms of
[batch_stride, height_stride, width_stride, depth_stride]. Typically,
depth_stride are always
1 as we don’t want to skip over examples in a batch or entire slices of volume. In the above example, we’re using
strides=[1,1,1,1] to specify that we’ll be moving the filters across the image one pixel at a time.
TensorFlow allows us to specify either
VALID padding does not pad the image with zeroes. Specifying
SAME pads the image with enough zeroes such that the output will have the same height and with dimensions as the input assuming we’re using a stride of
1. Most of the time we use
SAME padding so as not to have the output shrink at each layer of our network. To dig into the specifics of how padding is calculated, see TensorFlow’s documentation on convolutions.
Finally, we have to remember to include a bias term for each filter. Since we’ve created
64 filters, we’ll have to create a bias term of size
64. We apply bias after performing the convolution operation, but before passing the result to our ReLU non-linearity.
As the above shows, as the input flows through our network, intermediate representations (eg.
layer1_out) keep the same width and height while increasing in depth. However, if we continue making deeper and deeper representations we’ll find that the number of operations we need to perform will explode. Each of the filters has to be dragged across as
28x28 input and take the dot-product. As our filters get deeper this results in larger and larger groups of multiplications and additions.
Periodically we would like to downsample and compress our intermediate representations to have smaller height and width dimensions. The most common way to do this is by using a max pooling operation.
Max pooling is relatively simple. We slide a window (also called a kernel) along the input and simply take the max value at each point. As with convolutions, we can control the size of the sliding window, the stride of the window and choose whether or not to pad the input with zeroes.
Below is a simple example demonstrating max pooling on an unpadded input of
4x4 with a kernel size of
2x2 and a stride of
Max pooling is the most popular way to downsample, but it’s certainly not the only way. Alternatives include average-pooling, which takes the average value at each point or vanilla convolutions with stride of
2. For more on this approach see: The All Convolutional Net.
The most common form of max pooling uses a
2x2 kernel (
ksize=[1,2,2,1]) and a stride of
2 in the width and height dimensions (
Putting it all together
Finally we have all the pieces to build our first convolutional neural network. Below is a network with four convolutional layers and two max pooling layers (You can find the complete code at the end of this post).
Before diving into the code, let’s take a look at a visualization of our network from
pool2 to get a sense of what’s going on:
There are a few things worth noticing here. First, notice that
in_depth of each set of convolutional filters matches the depth of the previous layers. Also note that the depth of each intermediate layer is determined by the number of filters (
out_depth) at each layer.
We should also notice that every pooling layer we’ve used is a
2x2 max pooling operation using a
stride=[1,2,2,1]. Recall the default format for stride is
[batch_stride, height_stride, width_stride, depth_stride]. This means that we slide through the height and width dimensions twice as fast as depth. This results in a shrinkage of height and width by a factor of
2. As data moves through our network, the representations become deeper with smaller width and height dimensions.
Finally, the last six lines are a little bit tricky. At the conclusion of our network we need to make predictions about which number we’re seeing. The way we do that is by adding a fully connected layer at the very end of our network. We reshape
pool2 from a
7x7x128 3-D volume to a single vector with
6,272 values. Finally, we connect this vector to
10 output logits from which we can extract our predictions.
With everything in place, we can run our network and take a look at how well it performs:
Cost: 979579.0 Accuracy: 7.0000000298 % Cost: 174063.0 Accuracy: 23.9999994636 % Cost: 95255.1 Accuracy: 47.9999989271 % ... Cost: 10001.9 Accuracy: 87.9999995232 % Cost: 16117.2 Accuracy: 77.999997139 % Test Cost: 15083.0833307 Test accuracy: 81.8799999356 %
Yikes. There are two things that jump out at me when I look at these numbers:
- The cost seems very high despite achieving a reasonable result.
- The test accuracy has decreased when compared to our fully-connected network which achieved an accuracy of ~89%
So are convolutional nets broken? Was all this effort for nothing? Not quite. Next time we’ll look at an underlying problem with how we’re choosing our initial random weight values and an improved strategy that should improve our results beyond that of our fully-connected network.