LTFN 5: Building a ConvNet

In the last post we looked at the building blocks of a convolutional neural net. The convolution operation works by sliding a filter along the input and taking the dot product at each location to generate an output volume.

The parameters we need to consider when building a convolutional layer are:

1. Padding – Should we pad the input with zeroes?
2. Stride – Should we move the filter more than one pixel at a time?
3. Input depth – Each convolutional filter must have a depth that matches the input depth.
4. Number of filters – We can stack multiple filters to increase the depth of the output.

With this knowledge we can construct our first convolutional neural network. We’ll start by creating a single convolutional layer that operates on a batch of input images of size 28x28x1.

Visualization of layer1 with the corresponding dimensions marked.

We start by creating a 4-D Tensor for layer1_weights. This Tensor represents the weights of the various filters that will be used in our convolution and then trained via gradient descent. By default, TensorFlow uses the format [filter_height, filter_width, in_depth, out_depth] for convolutional filters. In this example, we’re defining 64 filters each of which has a height of 3, width of 3, and an input depth of 1.


It’s important to remember that in_depth must always match the depth of the input we’re convolving. If our images were RGB, we would have had to create filters with a depth of 3.

On the other hand, we can increase or decrease output depth simply by changing the value we specify for out_depth. This represents how many independent filters we’ll create and therefore the depth of the output. In our example, we’ve specified 64 filters and we can see layer1_conv has a corresponding depth of 64.


Stride represents how fast we move the filter along each dimension. By default, TensorFlow expects stride to be defined in terms of [batch_stride, height_stride, width_stride, depth_stride]. Typically, batch_stride and depth_stride are always 1 as we don’t want to skip over examples in a batch or entire slices of volume. In the above example, we’re using strides=[1,1,1,1] to specify that we’ll be moving the filters across the image one pixel at a time.


TensorFlow allows us to specify either SAME or VALID padding. VALID padding does not pad the image with zeroes. Specifying SAME pads the image with enough zeroes such that the output will have the same height and with dimensions as the input assuming we’re using a stride of 1. Most of the time we use SAME padding so as not to have the output shrink at each layer of our network. To dig into the specifics of how padding is calculated, see TensorFlow’s documentation on convolutions.


Finally, we have to remember to include a bias term for each filter. Since we’ve created 64 filters, we’ll have to create a bias term of size 64. We apply bias after performing the convolution operation, but before passing the result to our ReLU non-linearity.


Max Pooling

As the above shows, as the input flows through our network, intermediate representations (eg. layer1_out) keep the same width and height while increasing in depth. However, if we continue making deeper and deeper representations we’ll find that the number of operations we need to perform will explode. Each of the filters has to be dragged across as 28x28 input and take the dot-product. As our filters get deeper this results in larger and larger groups of multiplications and additions.

Periodically we would like to downsample and compress our intermediate representations to have smaller height and width dimensions. The most common way to do this is by using a max pooling operation.

Max pooling is relatively simple. We slide a window (also called a kernel) along the input and simply take the max value at each point. As with convolutions, we can control the size of the sliding window, the stride of the window and choose whether or not to pad the input with zeroes.

Below is a simple example demonstrating max pooling on an unpadded input of 4x4 with a kernel size of 2x2 and a stride of 2:

Max pooling is the most popular way to downsample, but it’s certainly not the only way. Alternatives include average-pooling, which takes the average value at each point or vanilla convolutions with stride of 2. For more on this approach see: The All Convolutional Net.

The most common form of max pooling uses a 2x2 kernel (ksize=[1,2,2,1]) and a stride of 2 in the width and height dimensions (stride=[1,2,2,1]).


Putting it all together

Finally we have all the pieces to build our first convolutional neural network. Below is a network with four convolutional layers and two max pooling layers (You can find the complete code at the end of this post).


Before diving into the code, let’s take a look at a visualization of our network from input through pool2 to get a sense of what’s going on:

Visualization of layers from input through pool2 (Click to enlarge).


There are a few things worth noticing here. First, notice that in_depth of each set of convolutional filters matches the depth of the previous layers. Also note that the depth of each intermediate layer is determined by the number of filters (out_depth) at each layer.

We should also notice that every pooling layer we’ve used is a 2x2 max pooling operation using a stride=[1,2,2,1]. Recall the default format for stride is [batch_stride, height_stride, width_stride, depth_stride]. This means that we slide through the height and width dimensions twice as fast as depth. This results in a shrinkage of height and width by a factor of 2. As data moves through our network, the representations become deeper with smaller width and height dimensions.

Finally, the last six lines are a little bit tricky. At the conclusion of our network we need to make predictions about which number we’re seeing. The way we do that is by adding a fully connected layer at the very end of our network. We reshape pool2 from a 7x7x128 3-D volume to a single vector with 6,272 values. Finally, we connect this vector to 10 output logits from which we can extract our predictions.

With everything in place, we can run our network and take a look at how well it performs:

Cost: 979579.0
Accuracy: 7.0000000298 %
Cost: 174063.0
Accuracy: 23.9999994636 %
Cost: 95255.1
Accuracy: 47.9999989271 %


Cost: 10001.9
Accuracy: 87.9999995232 %
Cost: 16117.2
Accuracy: 77.999997139 %
Test Cost: 15083.0833307
Test accuracy: 81.8799999356 %


Yikes. There are two things that jump out at me when I look at these numbers:

  1. The cost seems very high despite achieving a reasonable result.
  2. The test accuracy has decreased when compared to our fully-connected network which achieved an accuracy of ~89%


So are convolutional nets broken? Was all this effort for nothing? Not quite. Next time we’ll look at an underlying problem with how we’re choosing our initial random weight values and an improved strategy that should improve our results beyond that of our fully-connected network.


Complete Code


2 thoughts on “LTFN 5: Building a ConvNet

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s