LTFN 8: Deeper ConvNets

Part of the series Learn TensorFlow Now

Now that we’ve got a handle on convolutions, max pooling and weight initialization the obvious question is: What’s next? How should we set up our network to achieve the maximum accuracy on image recognition tasks? For years this has been a focus of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) competitions. Since 2010 researchers have battled various architectures against one another in an attempt to categorize millions of images into 1,000 categories. When tackling any image recognition task it’s usually a good idea to pick one of the top performing architectures instead of trying to craft your own from scratch.

VGGNet

VGGNet is a nice starting point as it’s simply a deeper version of the network we’ve been building. Its debut in the 2013 ILSVRC competition was novel due to its exclusive use of 3x3 convolutional filters. Previous architectures had attempted to use a variety of filter sizes including 11x11, 7x7 and 5x5. Each of these filter sizes was a hyper-parameter that had to be tuned so it was a relief to see high performance with both a consistent and small filter size.

As with our previous network, VGG operates by staggering max-pooling layers between groups of convolutional layers. Below is a table listing the 16 layers of VGG alongside the intermediate shapes at each layer of the network and the number of trainable parameters (ie. weights, excluding biases) in the network.

Original VGGNet

Layers Parameters
Layer Shape Intermediate Shape
Input: 224x224x3
64 3×3 Conv Filters 224 x 224 x 64 64 * 3 * 3 * 3 = 1,728
64 3×3 Conv Filters 224 x 224 x 64 64 * 3 * 3 * 64 = 36,864
maxpool 2×2 112 x 112 x 64
128 3×3 Conv Filters 112 x 112 x 128 128 * 3 * 3 * 64 = 73,728
128 3×3 Conv Filters 112 x 112 x 128 128 * 3 * 3 * 128 = 147,456
maxpool 2×2 56 x 56 x 256
256 3×3 Conv Filters 56 x 56 x 256 256 * 3 * 3 * 128 = 294,912
256 3×3 Conv Filters 56 x 56 x 256 256 * 3 * 3 * 256 = 589,824
256 3×3 Conv Filters 56 x 56 x 256 256 * 3 * 3 * 256 = 589,824
maxpool 2×2 28 x 28 x 256
512 3×3 Conv Filters 28 x 28 x 512 512 * 3 * 3 * 256 = 1,179,648
512 3×3 Conv Filters 28 x 28 x 512 512 * 3 * 3 * 512 = 2,359,296
512 3×3 Conv Filters 28 x 28 x 512 512 * 3 * 3 * 512 = 2,359,296
maxpool 14 x 14 x 512
512 3×3 Conv Filters 14 x 14 x 512 512 * 3 * 3 * 512 = 2,359,296
512 3×3 Conv Filters 14 x 14 x 512 512 * 3 * 3 * 512 = 2,359,296
512 3×3 Conv Filters 14 x 14 x 512 512 * 3 * 3 * 512 = 2,359,296
maxpool 7 x 7 x 512
FC 4096 1 x 1 x 4096 7 * 7 * 512 * 4096 = 102,760,448
FC 4096 1 x 1 x 4096 4096 * 4096 = 16,777,216
FC 1000 1 x 1 x 1000 4096 * 1000 = 4,096,000

 

A few things to note about the VGG architecture:

  • It was originally built for images of size 224x224x3 and 1,000 output classes.
  • The number of parameters increases exponentially as we move through the network.
  • There are so many trainable parameters that we can only reasonably run such a network on a computer with a GPU.

There are a couple of modifications we’ll make to the VGG network in order to use it on our MNIST digits of shape 28x28x1. Notice that after each max_pooling layer we halve the width and height dimensions. Unfortunately, our images just aren’t big enough to go through so many max_pooling layers. For this reason, we’ll omit the final max_pooling layer and the final three 512 3x3 convolutional layers. We’ll also pad our 28x28 images to be of size 32x32 so the widths and heights divide by two cleanly.

Modified VGGNet

Layers Parameters
Layer Shape Intermediate Shape
Input: 28 x 28 x 1
Pad Image 32 x 32 x 1
64 3×3 Conv Filters 32 x 32 x 64 64 * 3 * 3 * 3 = 1,728
64 3×3 Conv Filters 32 x 32 x 64 64 * 3 * 3 * 64 = 36,864
maxpool 2×2 16 x 16 x 64
128 3×3 Conv Filters 16 x 16 x 128 128 * 3 * 3 * 64 = 73,728
128 3×3 Conv Filters 16 x 16 x 128 128 * 3 * 3 * 128 = 147,456
maxpool 2×2 8 x 8 x 256
256 3×3 Conv Filters 8 x 8 x 256 256 * 3 * 3 * 128 = 294,912
256 3×3 Conv Filters 8 x 8 x 256 256 * 3 * 3 * 256 = 589,824
256 3×3 Conv Filters 8 x 8 x 256 256 * 3 * 3 * 256 = 589,824
maxpool 2×2 4 x 4 x 256
512 3×3 Conv Filters 4 x 4 x 512 512 * 3 * 3 * 256 = 1,179,648
512 3×3 Conv Filters 4 x 4 x 512 512 * 3 * 3 * 512 = 2,359,296
512 3×3 Conv Filters 4 x 4 x 512 512 * 3 * 3 * 512 = 2,359,296
maxpool 2 x 2 x 512
FC 4096 1 x 1 x 4096 2 * 2 * 512 * 4096 = 8,388,608
FC 10 1 x 1 x 10 4096 * 10 = 40,960

 

In previous posts we’ve encountered fully connected layers, convolutional layers and max pooling operations. The only portion of this network we’ve not seen before is the initial padding step. TensorFlow makes this easy to accomplish via tf.image.resize_image_with_crop_or_pad.

We’ll also make use of the tf.train.AdamOptimizer discussed in the previous post:

With these two changes, we can create our modified version of VGGNet, presented in full at the end of this post.

Running our network gives us the following output:

Cost: 3.19188
Accuracy: 10.9999999404 %
Cost: 0.140771
Accuracy: 94.9999988079 %
Cost: 0.120058
Accuracy: 95.9999978542 %
Cost: 0.128447
Accuracy: 97.000002861 %
Cost: 0.0849798
Accuracy: 95.9999978542 %
Cost: 0.0180758
Accuracy: 99.0000009537 %
Cost: 0.0622907
Accuracy: 99.0000009537 %
Cost: 0.147945
Accuracy: 95.9999978542 %
Cost: 0.0502743
Accuracy: 99.0000009537 %
Cost: 0.149534
Accuracy: 99.0000009537 %
Test Cost: 0.0713789960416
Test accuracy: 97.8600007892 %

 

Running this network gives us a test accuracy of ~97.9% compared to our previous best of 97.3%. This is an improvement, but we’re starting to see fairly marginal improvements. In fact, I wouldn’t necessarily be convinced that our VGG network truly outperforms our previous best without running each network multiple times and comparing the average accuracies achieved. There’s a very real possibility that our small improvement may have just been due to chance. We won’t run this comparison here, but it’s something to consider when you’re starting to see very marginal improvements in your own networks.

Next week we’ll take a look at saving and restoring our model and we’ll take a look at some of the images on which our network is making mistakes in order to build a better intuition for what might be going on.

 

Complete Code

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s