Part of the series Learn TensorFlow Now
Now that we’ve got a handle on convolutions, max pooling and weight initialization the obvious question is: What’s next? How should we set up our network to achieve the maximum accuracy on image recognition tasks? For years this has been a focus of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) competitions. Since 2010 researchers have battled various architectures against one another in an attempt to categorize millions of images into 1,000 categories. When tackling any image recognition task it’s usually a good idea to pick one of the top performing architectures instead of trying to craft your own from scratch.
VGGNet
VGGNet is a nice starting point as it’s simply a deeper version of the network we’ve been building. Its debut in the 2013 ILSVRC competition was novel due to its exclusive use of 3x3
convolutional filters. Previous architectures had attempted to use a variety of filter sizes including 11x11
, 7x7
and 5x5
. Each of these filter sizes was a hyper-parameter that had to be tuned so it was a relief to see high performance with both a consistent and small filter size.
As with our previous network, VGG operates by staggering max-pooling layers between groups of convolutional layers. Below is a table listing the 16
layers of VGG alongside the intermediate shapes at each layer of the network and the number of trainable parameters (ie. weights, excluding biases) in the network.
Original VGGNet
Layers | Parameters | |
---|---|---|
Layer Shape | Intermediate Shape | |
Input: 224x224x3 | ||
64 3×3 Conv Filters | 224 x 224 x 64 | 64 * 3 * 3 * 3 = 1,728 |
64 3×3 Conv Filters | 224 x 224 x 64 | 64 * 3 * 3 * 64 = 36,864 |
maxpool 2×2 | 112 x 112 x 64 | |
128 3×3 Conv Filters | 112 x 112 x 128 | 128 * 3 * 3 * 64 = 73,728 |
128 3×3 Conv Filters | 112 x 112 x 128 | 128 * 3 * 3 * 128 = 147,456 |
maxpool 2×2 | 56 x 56 x 256 | |
256 3×3 Conv Filters | 56 x 56 x 256 | 256 * 3 * 3 * 128 = 294,912 |
256 3×3 Conv Filters | 56 x 56 x 256 | 256 * 3 * 3 * 256 = 589,824 |
256 3×3 Conv Filters | 56 x 56 x 256 | 256 * 3 * 3 * 256 = 589,824 |
maxpool 2×2 | 28 x 28 x 256 | |
512 3×3 Conv Filters | 28 x 28 x 512 | 512 * 3 * 3 * 256 = 1,179,648 |
512 3×3 Conv Filters | 28 x 28 x 512 | 512 * 3 * 3 * 512 = 2,359,296 |
512 3×3 Conv Filters | 28 x 28 x 512 | 512 * 3 * 3 * 512 = 2,359,296 |
maxpool | 14 x 14 x 512 | |
512 3×3 Conv Filters | 14 x 14 x 512 | 512 * 3 * 3 * 512 = 2,359,296 |
512 3×3 Conv Filters | 14 x 14 x 512 | 512 * 3 * 3 * 512 = 2,359,296 |
512 3×3 Conv Filters | 14 x 14 x 512 | 512 * 3 * 3 * 512 = 2,359,296 |
maxpool | 7 x 7 x 512 | |
FC 4096 | 1 x 1 x 4096 | 7 * 7 * 512 * 4096 = 102,760,448 |
FC 4096 | 1 x 1 x 4096 | 4096 * 4096 = 16,777,216 |
FC 1000 | 1 x 1 x 1000 | 4096 * 1000 = 4,096,000 |
A few things to note about the VGG architecture:
- It was originally built for images of size
224x224x3
and1,000
output classes. - The number of parameters increases exponentially as we move through the network.
- There are so many trainable parameters that we can only reasonably run such a network on a computer with a GPU.
There are a couple of modifications we’ll make to the VGG network in order to use it on our MNIST digits of shape 28x28x1
. Notice that after each max_pooling
layer we halve the width and height dimensions. Unfortunately, our images just aren’t big enough to go through so many max_pooling
layers. For this reason, we’ll omit the final max_pooling
layer and the final three 512 3x3
convolutional layers. We’ll also pad our 28x28
images to be of size 32x32
so the widths and heights divide by two cleanly.
Modified VGGNet
Layers | Parameters | |
---|---|---|
Layer Shape | Intermediate Shape | |
Input: 28 x 28 x 1 | ||
Pad Image | 32 x 32 x 1 | |
64 3×3 Conv Filters | 32 x 32 x 64 | 64 * 3 * 3 * 3 = 1,728 |
64 3×3 Conv Filters | 32 x 32 x 64 | 64 * 3 * 3 * 64 = 36,864 |
maxpool 2×2 | 16 x 16 x 64 | |
128 3×3 Conv Filters | 16 x 16 x 128 | 128 * 3 * 3 * 64 = 73,728 |
128 3×3 Conv Filters | 16 x 16 x 128 | 128 * 3 * 3 * 128 = 147,456 |
maxpool 2×2 | 8 x 8 x 256 | |
256 3×3 Conv Filters | 8 x 8 x 256 | 256 * 3 * 3 * 128 = 294,912 |
256 3×3 Conv Filters | 8 x 8 x 256 | 256 * 3 * 3 * 256 = 589,824 |
256 3×3 Conv Filters | 8 x 8 x 256 | 256 * 3 * 3 * 256 = 589,824 |
maxpool 2×2 | 4 x 4 x 256 | |
512 3×3 Conv Filters | 4 x 4 x 512 | 512 * 3 * 3 * 256 = 1,179,648 |
512 3×3 Conv Filters | 4 x 4 x 512 | 512 * 3 * 3 * 512 = 2,359,296 |
512 3×3 Conv Filters | 4 x 4 x 512 | 512 * 3 * 3 * 512 = 2,359,296 |
maxpool | 2 x 2 x 512 | |
FC 4096 | 1 x 1 x 4096 | 2 * 2 * 512 * 4096 = 8,388,608 |
FC 10 | 1 x 1 x 10 | 4096 * 10 = 40,960 |
In previous posts we’ve encountered fully connected layers, convolutional layers and max pooling operations. The only portion of this network we’ve not seen before is the initial padding step. TensorFlow makes this easy to accomplish via tf.image.resize_image_with_crop_or_pad
.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
input = tf.placeholder(tf.float32, shape=(None, 28, 28, 1)) #28x28x1 | |
padded_input = tf.image.resize_image_with_crop_or_pad(input, target_height=32, target_width=32) #32x32x1 |
We’ll also make use of the tf.train.AdamOptimizer
discussed in the previous post:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
learning_rate = 0.001 | |
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost) |
With these two changes, we can create our modified version of VGGNet, presented in full at the end of this post.
Running our network gives us the following output:
Cost: 3.19188 Accuracy: 10.9999999404 % Cost: 0.140771 Accuracy: 94.9999988079 % Cost: 0.120058 Accuracy: 95.9999978542 % Cost: 0.128447 Accuracy: 97.000002861 % Cost: 0.0849798 Accuracy: 95.9999978542 % Cost: 0.0180758 Accuracy: 99.0000009537 % Cost: 0.0622907 Accuracy: 99.0000009537 % Cost: 0.147945 Accuracy: 95.9999978542 % Cost: 0.0502743 Accuracy: 99.0000009537 % Cost: 0.149534 Accuracy: 99.0000009537 % Test Cost: 0.0713789960416 Test accuracy: 97.8600007892 %
Running this network gives us a test accuracy of ~97.9% compared to our previous best of 97.3%. This is an improvement, but we’re starting to see fairly marginal improvements. In fact, I wouldn’t necessarily be convinced that our VGG network truly outperforms our previous best without running each network multiple times and comparing the average accuracies achieved. There’s a very real possibility that our small improvement may have just been due to chance. We won’t run this comparison here, but it’s something to consider when you’re starting to see very marginal improvements in your own networks.
Next week we’ll take a look at saving and restoring our model and we’ll take a look at some of the images on which our network is making mistakes in order to build a better intuition for what might be going on.
Complete Code
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import tensorflow as tf | |
import numpy as np | |
from tensorflow.examples.tutorials.mnist import input_data | |
mnist = input_data.read_data_sets('MNIST_data', one_hot=True) | |
train_images = np.reshape(mnist.train.images, (–1, 28, 28, 1)) | |
train_labels = mnist.train.labels | |
test_images = np.reshape(mnist.test.images, (–1, 28, 28, 1)) | |
test_labels = mnist.test.labels | |
graph = tf.Graph() | |
with graph.as_default(): | |
input = tf.placeholder(tf.float32, shape=(None, 28, 28, 1)) | |
labels = tf.placeholder(tf.float32, shape=(None, 10)) | |
padded_input = tf.image.resize_image_with_crop_or_pad(input, target_height=32, target_width=32) | |
layer1_weights = tf.get_variable("layer1_weights", [3, 3, 1, 64], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer1_bias = tf.Variable(tf.zeros([64])) | |
layer1_conv = tf.nn.conv2d(padded_input, filter=layer1_weights, strides=[1,1,1,1], padding='SAME') | |
layer1_out = tf.nn.relu(layer1_conv + layer1_bias) | |
layer2_weights = tf.get_variable("layer2_weights", [3, 3, 64, 64], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer2_bias = tf.Variable(tf.zeros([64])) | |
layer2_conv = tf.nn.conv2d(layer1_out, filter=layer2_weights, strides=[1,1,1,1], padding='SAME') | |
layer2_out = tf.nn.relu(layer2_conv + layer2_bias) | |
pool1 = tf.nn.max_pool(layer2_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID') | |
layer3_weights = tf.get_variable("layer3_weights", [3, 3, 64, 128], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer3_bias = tf.Variable(tf.zeros([128])) | |
layer3_conv = tf.nn.conv2d(pool1, filter=layer3_weights, strides=[1,1,1,1], padding='SAME') | |
layer3_out = tf.nn.relu(layer3_conv + layer3_bias) | |
layer4_weights = tf.get_variable("layer4_weights", [3, 3, 128, 128], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer4_bias = tf.Variable(tf.zeros([128])) | |
layer4_conv = tf.nn.conv2d(layer3_out, filter=layer4_weights, strides=[1,1,1,1], padding='SAME') | |
layer4_out = tf.nn.relu(layer4_conv + layer4_bias) | |
pool2 = tf.nn.max_pool(layer4_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID') | |
layer5_weights = tf.get_variable("layer5_weights", [3, 3, 128, 256], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer5_bias = tf.Variable(tf.zeros([256])) | |
layer5_conv = tf.nn.conv2d(pool2, filter=layer5_weights, strides=[1,1,1,1], padding='SAME') | |
layer5_out = tf.nn.relu(layer5_conv + layer5_bias) | |
layer6_weights = tf.get_variable("layer6_weights", [3, 3, 256, 256], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer6_bias = tf.Variable(tf.zeros([256])) | |
layer6_conv = tf.nn.conv2d(layer5_out, filter=layer6_weights, strides=[1,1,1,1], padding='SAME') | |
layer6_out = tf.nn.relu(layer6_conv + layer6_bias) | |
layer7_weights = tf.get_variable("layer7_weights", [3, 3, 256, 256], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer7_bias = tf.Variable(tf.zeros([256])) | |
layer7_conv = tf.nn.conv2d(layer6_out, filter=layer7_weights, strides=[1,1,1,1], padding='SAME') | |
layer7_out = tf.nn.relu(layer7_conv + layer7_bias) | |
pool3 = tf.nn.max_pool(layer7_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID') | |
layer8_weights = tf.get_variable("layer8_weights", [3, 3, 256, 512], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer8_bias = tf.Variable(tf.zeros([512])) | |
layer8_conv = tf.nn.conv2d(pool3, filter=layer8_weights, strides=[1,1,1,1], padding='SAME') | |
layer8_out = tf.nn.relu(layer8_conv + layer8_bias) | |
layer9_weights = tf.get_variable("layer9_weights", [3, 3, 512, 512], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer9_bias = tf.Variable(tf.zeros([512])) | |
layer9_conv = tf.nn.conv2d(layer8_out, filter=layer9_weights, strides=[1,1,1,1], padding='SAME') | |
layer9_out = tf.nn.relu(layer9_conv + layer9_bias) | |
layer10_weights = tf.get_variable("layer10_weights", [3, 3, 512, 512], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer10_bias = tf.Variable(tf.zeros([512])) | |
layer10_conv = tf.nn.conv2d(layer9_out, filter=layer10_weights, strides=[1,1,1,1], padding='SAME') | |
layer10_out = tf.nn.relu(layer10_conv + layer10_bias) | |
pool4 = tf.nn.max_pool(layer10_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID') | |
shape = pool4.shape.as_list() | |
newShape = shape[1] * shape[2] * shape[3] | |
reshaped_pool4 = tf.reshape(pool4, [–1, newShape]) | |
fc1_weights = tf.get_variable("layer11_weights", [newShape, 4096], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
fc1_bias = tf.Variable(tf.zeros([4096])) | |
fc1_out = tf.nn.relu(tf.matmul(reshaped_pool4, fc1_weights) + fc1_bias) | |
fc2_weights = tf.get_variable("layer12_weights", [4096, 10], initializer=tf.contrib.layers.xavier_initializer()) | |
fc2_bias = tf.Variable(tf.zeros([10])) | |
logits = tf.matmul(fc1_out, fc2_weights) + fc2_bias | |
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels)) | |
learning_rate = 0.001 | |
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost) | |
#Add a few nodes to calculate accuracy and optionally retrieve predictions | |
predictions = tf.nn.softmax(logits) | |
correct_prediction = tf.equal(tf.argmax(labels, 1), tf.argmax(predictions, 1)) | |
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) | |
with tf.Session(graph=graph) as session: | |
tf.global_variables_initializer().run() | |
num_steps = 1000 | |
batch_size = 100 | |
for step in range(num_steps): | |
offset = (step * batch_size) % (train_labels.shape[0] – batch_size) | |
batch_images = train_images[offset😦offset + batch_size), :] | |
batch_labels = train_labels[offset😦offset + batch_size), :] | |
feed_dict = {input: batch_images, labels: batch_labels} | |
_, c, acc = session.run([optimizer, cost, accuracy], feed_dict=feed_dict) | |
if step % 100 == 0: | |
print("Cost: ", c) | |
print("Accuracy: ", acc * 100.0, "%") | |
#Test | |
num_test_batches = int(len(test_images) / 100) | |
total_accuracy = 0 | |
total_cost = 0 | |
for step in range(num_test_batches): | |
offset = (step * batch_size) % (train_labels.shape[0] – batch_size) | |
batch_images = test_images[offset😦offset + batch_size)] | |
batch_labels = test_labels[offset😦offset + batch_size)] | |
feed_dict = {input: batch_images, labels: batch_labels} | |
_, c, acc = session.run([optimizer, cost, accuracy], feed_dict=feed_dict) | |
total_cost = total_cost + c | |
total_accuracy = total_accuracy + acc | |
print("Test Cost: ", total_cost / num_test_batches) | |
print("Test accuracy: ", total_accuracy * 100.0 / num_test_batches, "%") |