LTFN 8: Deeper ConvNets

Part of the series Learn TensorFlow Now

Now that we’ve got a handle on convolutions, max pooling and weight initialization the obvious question is: What’s next? How should we set up our network to achieve the maximum accuracy on image recognition tasks? For years this has been a focus of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) competitions. Since 2010 researchers have battled various architectures against one another in an attempt to categorize millions of images into 1,000 categories. When tackling any image recognition task it’s usually a good idea to pick one of the top performing architectures instead of trying to craft your own from scratch.

VGGNet

VGGNet is a nice starting point as it’s simply a deeper version of the network we’ve been building. Its debut in the 2013 ILSVRC competition was novel due to its exclusive use of 3x3 convolutional filters. Previous architectures had attempted to use a variety of filter sizes including 11x11, 7x7 and 5x5. Each of these filter sizes was a hyper-parameter that had to be tuned so it was a relief to see high performance with both a consistent and small filter size.

As with our previous network, VGG operates by staggering max-pooling layers between groups of convolutional layers. Below is a table listing the 16 layers of VGG alongside the intermediate shapes at each layer of the network and the number of trainable parameters (ie. weights, excluding biases) in the network.

Original VGGNet

Layers		Parameters
Layer Shape	Intermediate Shape
	Input: 224x224x3
64 3×3 Conv Filters	224 x 224 x 64	64 * 3 * 3 * 3 = 1,728
64 3×3 Conv Filters	224 x 224 x 64	64 * 3 * 3 * 64 = 36,864
maxpool 2×2	112 x 112 x 64
128 3×3 Conv Filters	112 x 112 x 128	128 * 3 * 3 * 64 = 73,728
128 3×3 Conv Filters	112 x 112 x 128	128 * 3 * 3 * 128 = 147,456
maxpool 2×2	56 x 56 x 256
256 3×3 Conv Filters	56 x 56 x 256	256 * 3 * 3 * 128 = 294,912
256 3×3 Conv Filters	56 x 56 x 256	256 * 3 * 3 * 256 = 589,824
256 3×3 Conv Filters	56 x 56 x 256	256 * 3 * 3 * 256 = 589,824
maxpool 2×2	28 x 28 x 256
512 3×3 Conv Filters	28 x 28 x 512	512 * 3 * 3 * 256 = 1,179,648
512 3×3 Conv Filters	28 x 28 x 512	512 * 3 * 3 * 512 = 2,359,296
512 3×3 Conv Filters	28 x 28 x 512	512 * 3 * 3 * 512 = 2,359,296
maxpool	14 x 14 x 512
512 3×3 Conv Filters	14 x 14 x 512	512 * 3 * 3 * 512 = 2,359,296
512 3×3 Conv Filters	14 x 14 x 512	512 * 3 * 3 * 512 = 2,359,296
512 3×3 Conv Filters	14 x 14 x 512	512 * 3 * 3 * 512 = 2,359,296
maxpool	7 x 7 x 512
FC 4096	1 x 1 x 4096	7 * 7 * 512 * 4096 = 102,760,448
FC 4096	1 x 1 x 4096	4096 * 4096 = 16,777,216
FC 1000	1 x 1 x 1000	4096 * 1000 = 4,096,000

A few things to note about the VGG architecture:

It was originally built for images of size 224x224x3 and 1,000 output classes.
The number of parameters increases exponentially as we move through the network.
There are so many trainable parameters that we can only reasonably run such a network on a computer with a GPU.

There are a couple of modifications we’ll make to the VGG network in order to use it on our MNIST digits of shape 28x28x1. Notice that after each max_pooling layer we halve the width and height dimensions. Unfortunately, our images just aren’t big enough to go through so many max_pooling layers. For this reason, we’ll omit the final max_pooling layer and the final three 512 3x3 convolutional layers. We’ll also pad our 28x28 images to be of size 32x32 so the widths and heights divide by two cleanly.

Modified VGGNet

Layers		Parameters
Layer Shape	Intermediate Shape
	Input: 28 x 28 x 1
Pad Image	32 x 32 x 1
64 3×3 Conv Filters	32 x 32 x 64	64 * 3 * 3 * 3 = 1,728
64 3×3 Conv Filters	32 x 32 x 64	64 * 3 * 3 * 64 = 36,864
maxpool 2×2	16 x 16 x 64
128 3×3 Conv Filters	16 x 16 x 128	128 * 3 * 3 * 64 = 73,728
128 3×3 Conv Filters	16 x 16 x 128	128 * 3 * 3 * 128 = 147,456
maxpool 2×2	8 x 8 x 256
256 3×3 Conv Filters	8 x 8 x 256	256 * 3 * 3 * 128 = 294,912
256 3×3 Conv Filters	8 x 8 x 256	256 * 3 * 3 * 256 = 589,824
256 3×3 Conv Filters	8 x 8 x 256	256 * 3 * 3 * 256 = 589,824
maxpool 2×2	4 x 4 x 256
512 3×3 Conv Filters	4 x 4 x 512	512 * 3 * 3 * 256 = 1,179,648
512 3×3 Conv Filters	4 x 4 x 512	512 * 3 * 3 * 512 = 2,359,296
512 3×3 Conv Filters	4 x 4 x 512	512 * 3 * 3 * 512 = 2,359,296
maxpool	2 x 2 x 512
FC 4096	1 x 1 x 4096	2 * 2 * 512 * 4096 = 8,388,608
FC 10	1 x 1 x 10	4096 * 10 = 40,960

In previous posts we’ve encountered fully connected layers, convolutional layers and max pooling operations. The only portion of this network we’ve not seen before is the initial padding step. TensorFlow makes this easy to accomplish via tf.image.resize_image_with_crop_or_pad.

	input = tf.placeholder(tf.float32, shape=(None, 28, 28, 1)) #28x28x1
	padded_input = tf.image.resize_image_with_crop_or_pad(input, target_height=32, target_width=32) #32x32x1

view raw

ltfn_8_1.py

hosted with ❤ by GitHub

We’ll also make use of the tf.train.AdamOptimizer discussed in the previous post:

	learning_rate = 0.001
	optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

view raw

ltfn_8_2.py

hosted with ❤ by GitHub

With these two changes, we can create our modified version of VGGNet, presented in full at the end of this post.

Running our network gives us the following output:

Cost: 3.19188
Accuracy: 10.9999999404 %
Cost: 0.140771
Accuracy: 94.9999988079 %
Cost: 0.120058
Accuracy: 95.9999978542 %
Cost: 0.128447
Accuracy: 97.000002861 %
Cost: 0.0849798
Accuracy: 95.9999978542 %
Cost: 0.0180758
Accuracy: 99.0000009537 %
Cost: 0.0622907
Accuracy: 99.0000009537 %
Cost: 0.147945
Accuracy: 95.9999978542 %
Cost: 0.0502743
Accuracy: 99.0000009537 %
Cost: 0.149534
Accuracy: 99.0000009537 %
Test Cost: 0.0713789960416
Test accuracy: 97.8600007892 %

Running this network gives us a test accuracy of ~97.9% compared to our previous best of 97.3%. This is an improvement, but we’re starting to see fairly marginal improvements. In fact, I wouldn’t necessarily be convinced that our VGG network truly outperforms our previous best without running each network multiple times and comparing the average accuracies achieved. There’s a very real possibility that our small improvement may have just been due to chance. We won’t run this comparison here, but it’s something to consider when you’re starting to see very marginal improvements in your own networks.

Next week we’ll take a look at saving and restoring our model and we’ll take a look at some of the images on which our network is making mistakes in order to build a better intuition for what might be going on.

Complete Code

	import tensorflow as tf
	import numpy as np
	from tensorflow.examples.tutorials.mnist import input_data
	mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

	train_images = np.reshape(mnist.train.images, (-1, 28, 28, 1))
	train_labels = mnist.train.labels
	test_images = np.reshape(mnist.test.images, (-1, 28, 28, 1))
	test_labels = mnist.test.labels

	graph = tf.Graph()
	with graph.as_default():
	input = tf.placeholder(tf.float32, shape=(None, 28, 28, 1))
	labels = tf.placeholder(tf.float32, shape=(None, 10))

	padded_input = tf.image.resize_image_with_crop_or_pad(input, target_height=32, target_width=32)

	layer1_weights = tf.get_variable("layer1_weights", [3, 3, 1, 64], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer1_bias = tf.Variable(tf.zeros([64]))
	layer1_conv = tf.nn.conv2d(padded_input, filter=layer1_weights, strides=[1,1,1,1], padding='SAME')
	layer1_out = tf.nn.relu(layer1_conv + layer1_bias)

	layer2_weights = tf.get_variable("layer2_weights", [3, 3, 64, 64], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer2_bias = tf.Variable(tf.zeros([64]))
	layer2_conv = tf.nn.conv2d(layer1_out, filter=layer2_weights, strides=[1,1,1,1], padding='SAME')
	layer2_out = tf.nn.relu(layer2_conv + layer2_bias)

	pool1 = tf.nn.max_pool(layer2_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID')

	layer3_weights = tf.get_variable("layer3_weights", [3, 3, 64, 128], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer3_bias = tf.Variable(tf.zeros([128]))
	layer3_conv = tf.nn.conv2d(pool1, filter=layer3_weights, strides=[1,1,1,1], padding='SAME')
	layer3_out = tf.nn.relu(layer3_conv + layer3_bias)

	layer4_weights = tf.get_variable("layer4_weights", [3, 3, 128, 128], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer4_bias = tf.Variable(tf.zeros([128]))
	layer4_conv = tf.nn.conv2d(layer3_out, filter=layer4_weights, strides=[1,1,1,1], padding='SAME')
	layer4_out = tf.nn.relu(layer4_conv + layer4_bias)

	pool2 = tf.nn.max_pool(layer4_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID')

	layer5_weights = tf.get_variable("layer5_weights", [3, 3, 128, 256], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer5_bias = tf.Variable(tf.zeros([256]))
	layer5_conv = tf.nn.conv2d(pool2, filter=layer5_weights, strides=[1,1,1,1], padding='SAME')
	layer5_out = tf.nn.relu(layer5_conv + layer5_bias)

	layer6_weights = tf.get_variable("layer6_weights", [3, 3, 256, 256], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer6_bias = tf.Variable(tf.zeros([256]))
	layer6_conv = tf.nn.conv2d(layer5_out, filter=layer6_weights, strides=[1,1,1,1], padding='SAME')
	layer6_out = tf.nn.relu(layer6_conv + layer6_bias)

	layer7_weights = tf.get_variable("layer7_weights", [3, 3, 256, 256], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer7_bias = tf.Variable(tf.zeros([256]))
	layer7_conv = tf.nn.conv2d(layer6_out, filter=layer7_weights, strides=[1,1,1,1], padding='SAME')
	layer7_out = tf.nn.relu(layer7_conv + layer7_bias)

	pool3 = tf.nn.max_pool(layer7_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID')

	layer8_weights = tf.get_variable("layer8_weights", [3, 3, 256, 512], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer8_bias = tf.Variable(tf.zeros([512]))
	layer8_conv = tf.nn.conv2d(pool3, filter=layer8_weights, strides=[1,1,1,1], padding='SAME')
	layer8_out = tf.nn.relu(layer8_conv + layer8_bias)

	layer9_weights = tf.get_variable("layer9_weights", [3, 3, 512, 512], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer9_bias = tf.Variable(tf.zeros([512]))
	layer9_conv = tf.nn.conv2d(layer8_out, filter=layer9_weights, strides=[1,1,1,1], padding='SAME')
	layer9_out = tf.nn.relu(layer9_conv + layer9_bias)

	layer10_weights = tf.get_variable("layer10_weights", [3, 3, 512, 512], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer10_bias = tf.Variable(tf.zeros([512]))
	layer10_conv = tf.nn.conv2d(layer9_out, filter=layer10_weights, strides=[1,1,1,1], padding='SAME')
	layer10_out = tf.nn.relu(layer10_conv + layer10_bias)

	pool4 = tf.nn.max_pool(layer10_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID')

	shape = pool4.shape.as_list()
	newShape = shape[1] * shape[2] * shape[3]
	reshaped_pool4 = tf.reshape(pool4, [-1, newShape])

	fc1_weights = tf.get_variable("layer11_weights", [newShape, 4096], initializer=tf.contrib.layers.variance_scaling_initializer())
	fc1_bias = tf.Variable(tf.zeros([4096]))
	fc1_out = tf.nn.relu(tf.matmul(reshaped_pool4, fc1_weights) + fc1_bias)

	fc2_weights = tf.get_variable("layer12_weights", [4096, 10], initializer=tf.contrib.layers.xavier_initializer())
	fc2_bias = tf.Variable(tf.zeros([10]))
	logits = tf.matmul(fc1_out, fc2_weights) + fc2_bias

	cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))

	learning_rate = 0.001
	optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

	#Add a few nodes to calculate accuracy and optionally retrieve predictions
	predictions = tf.nn.softmax(logits)
	correct_prediction = tf.equal(tf.argmax(labels, 1), tf.argmax(predictions, 1))
	accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

	with tf.Session(graph=graph) as session:
	tf.global_variables_initializer().run()

	num_steps = 1000
	batch_size = 100
	for step in range(num_steps):
	offset = (step * batch_size) % (train_labels.shape[0] – batch_size)
	batch_images = train_images[offset:(offset + batch_size), :]
	batch_labels = train_labels[offset:(offset + batch_size), :]
	feed_dict = {input: batch_images, labels: batch_labels}

	_, c, acc = session.run([optimizer, cost, accuracy], feed_dict=feed_dict)

	if step % 100 == 0:
	print("Cost: ", c)
	print("Accuracy: ", acc * 100.0, "%")

	#Test
	num_test_batches = int(len(test_images) / 100)
	total_accuracy = 0
	total_cost = 0
	for step in range(num_test_batches):
	offset = (step * batch_size) % (train_labels.shape[0] – batch_size)
	batch_images = test_images[offset:(offset + batch_size)]
	batch_labels = test_labels[offset:(offset + batch_size)]
	feed_dict = {input: batch_images, labels: batch_labels}

	_, c, acc = session.run([optimizer, cost, accuracy], feed_dict=feed_dict)
	total_cost = total_cost + c
	total_accuracy = total_accuracy + acc

	print("Test Cost: ", total_cost / num_test_batches)
	print("Test accuracy: ", total_accuracy * 100.0 / num_test_batches, "%")

view raw

ltfn_8_full.py

hosted with ❤ by GitHub

LTFN 8: Deeper ConvNets

VGGNet

Original VGGNet

Modified VGGNet

Complete Code

Published by joshvarty

Leave a comment Cancel reply

VGGNet

Original VGGNet

Modified VGGNet

Complete Code

Share this:

Related

Published by joshvarty

Leave a comment Cancel reply