LTFN 5: Building a ConvNet

joshvarty TensorFlow February 12, 2018March 14, 2018 5 Minutes

Part of the series Learn TensorFlow Now

In the last post we looked at the building blocks of a convolutional neural net. The convolution operation works by sliding a filter along the input and taking the dot product at each location to generate an output volume.

The parameters we need to consider when building a convolutional layer are:

1. Padding – Should we pad the input with zeroes?
2. Stride – Should we move the filter more than one pixel at a time?
3. Input depth – Each convolutional filter must have a depth that matches the input depth.
4. Number of filters – We can stack multiple filters to increase the depth of the output.

With this knowledge we can construct our first convolutional neural network. We’ll start by creating a single convolutional layer that operates on a batch of input images of size 28x28x1.

	layer1_weights = tf.Variable(tf.random_normal([3, 3, 1, 64])) #3x3x1x64
	layer1_bias = tf.Variable(tf.zeros([64])) #64
	layer1_conv = tf.nn.conv2d(input, filter=layer1_weights, strides=[1,1,1,1], padding='SAME') #28x28x64
	layer1_out = tf.nn.relu(layer1_conv + layer1_bias) #28x28x64

view raw

ltfn_5_1.py

hosted with ❤ by GitHub

Visualization of layer1 with the corresponding dimensions marked.

We start by creating a 4-D Tensor for layer1_weights. This Tensor represents the weights of the various filters that will be used in our convolution and then trained via gradient descent. By default, TensorFlow uses the format [filter_height, filter_width, in_depth, out_depth] for convolutional filters. In this example, we’re defining 64 filters each of which has a height of 3, width of 3, and an input depth of 1.

Depth

It’s important to remember that in_depth must always match the depth of the input we’re convolving. If our images were RGB, we would have had to create filters with a depth of 3.

On the other hand, we can increase or decrease output depth simply by changing the value we specify for out_depth. This represents how many independent filters we’ll create and therefore the depth of the output. In our example, we’ve specified 64 filters and we can see layer1_conv has a corresponding depth of 64.

Stride

Stride represents how fast we move the filter along each dimension. By default, TensorFlow expects stride to be defined in terms of [batch_stride, height_stride, width_stride, depth_stride]. Typically, batch_stride and depth_stride are always 1 as we don’t want to skip over examples in a batch or entire slices of volume. In the above example, we’re using strides=[1,1,1,1] to specify that we’ll be moving the filters across the image one pixel at a time.

Padding

TensorFlow allows us to specify either SAME or VALID padding. VALID padding does not pad the image with zeroes. Specifying SAME pads the image with enough zeroes such that the output will have the same height and with dimensions as the input assuming we’re using a stride of 1. Most of the time we use SAME padding so as not to have the output shrink at each layer of our network. To dig into the specifics of how padding is calculated, see TensorFlow’s documentation on convolutions.

Bias

Finally, we have to remember to include a bias term for each filter. Since we’ve created 64 filters, we’ll have to create a bias term of size 64. We apply bias after performing the convolution operation, but before passing the result to our ReLU non-linearity.

Max Pooling

As the above shows, as the input flows through our network, intermediate representations (eg. layer1_out) keep the same width and height while increasing in depth. However, if we continue making deeper and deeper representations we’ll find that the number of operations we need to perform will explode. Each of the filters has to be dragged across as 28x28 input and take the dot-product. As our filters get deeper this results in larger and larger groups of multiplications and additions.

Periodically we would like to downsample and compress our intermediate representations to have smaller height and width dimensions. The most common way to do this is by using a max pooling operation.

Max pooling is relatively simple. We slide a window (also called a kernel) along the input and simply take the max value at each point. As with convolutions, we can control the size of the sliding window, the stride of the window and choose whether or not to pad the input with zeroes.

Below is a simple example demonstrating max pooling on an unpadded input of 4x4 with a kernel size of 2x2 and a stride of 2:

Max pooling is the most popular way to downsample, but it’s certainly not the only way. Alternatives include average-pooling, which takes the average value at each point or vanilla convolutions with stride of 2. For more on this approach see: The All Convolutional Net.

The most common form of max pooling uses a 2x2 kernel (ksize=[1,2,2,1]) and a stride of 2 in the width and height dimensions (stride=[1,2,2,1]).

Putting it all together

Finally we have all the pieces to build our first convolutional neural network. Below is a network with four convolutional layers and two max pooling layers (You can find the complete code at the end of this post).

	layer1_weights = tf.Variable(tf.random_normal([3, 3, 1, 64])) #3x3x1x64
	layer1_bias = tf.Variable(tf.zeros([64])) #64
	layer1_conv = tf.nn.conv2d(input, filter=layer1_weights, strides=[1,1,1,1], padding='SAME') #28x28x64
	layer1_out = tf.nn.relu(layer1_conv + layer1_bias) #28x28x64

	layer2_weights = tf.Variable(tf.random_normal([3, 3, 64, 64])) #3x3x64x64
	layer2_bias = tf.Variable(tf.zeros([64])) #64
	layer2_conv = tf.nn.conv2d(layer1_out, filter=layer2_weights, strides=[1,1,1,1], padding='SAME')#28x28x64
	layer2_out = tf.nn.relu(layer2_conv + layer2_bias) #28x28x64

	pool1 = tf.nn.max_pool(layer2_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID') #14x14x64

	layer3_weights = tf.Variable(tf.random_normal([3, 3, 64, 128])) #3x3x64x128
	layer3_bias = tf.Variable(tf.zeros([128])) #128
	layer3_conv = tf.nn.conv2d(pool1, filter=layer3_weights, strides=[1,1,1,1], padding='SAME') #14x14x128
	layer3_out = tf.nn.relu(layer3_conv + layer3_bias) #14x14x128

	layer4_weights = tf.Variable(tf.random_normal([3, 3, 128, 128])) #3x3x128x128
	layer4_bias = tf.Variable(tf.zeros([128])) #128
	layer4_conv = tf.nn.conv2d(layer3_out, filter=layer4_weights, strides=[1,1,1,1], padding='SAME')#14x14x128
	layer4_out = tf.nn.relu(layer4_conv + layer4_bias) #14x14x128

	pool2 = tf.nn.max_pool(layer4_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID') #7x7x128

	shape = pool2.shape.as_list()
	fc = shape[1] * shape[2] * shape[3] #7x7x256 = 6,272
	reshape = tf.reshape(pool2, [-1, fc])
	fully_connected_weights = tf.Variable(tf.random_normal([fc, 10])) #6,272×10
	fully_connected_bias = tf.Variable(tf.zeros([10])) #10
	logits = tf.matmul(reshape, fully_connected_weights) + fully_connected_bias #10

view raw

ltfn_5_2.py

hosted with ❤ by GitHub

Before diving into the code, let’s take a look at a visualization of our network from input through pool2 to get a sense of what’s going on:

Visualization of layers from input through pool2 (Click to enlarge).

There are a few things worth noticing here. First, notice that in_depth of each set of convolutional filters matches the depth of the previous layers. Also note that the depth of each intermediate layer is determined by the number of filters (out_depth) at each layer.

We should also notice that every pooling layer we’ve used is a 2x2 max pooling operation using a stride=[1,2,2,1]. Recall the default format for stride is [batch_stride, height_stride, width_stride, depth_stride]. This means that we slide through the height and width dimensions twice as fast as depth. This results in a shrinkage of height and width by a factor of 2. As data moves through our network, the representations become deeper with smaller width and height dimensions.

Finally, the last six lines are a little bit tricky. At the conclusion of our network we need to make predictions about which number we’re seeing. The way we do that is by adding a fully connected layer at the very end of our network. We reshape pool2 from a 7x7x128 3-D volume to a single vector with 6,272 values. Finally, we connect this vector to 10 output logits from which we can extract our predictions.

With everything in place, we can run our network and take a look at how well it performs:

Cost: 979579.0
Accuracy: 7.0000000298 %
Cost: 174063.0
Accuracy: 23.9999994636 %
Cost: 95255.1
Accuracy: 47.9999989271 %

...

Cost: 10001.9
Accuracy: 87.9999995232 %
Cost: 16117.2
Accuracy: 77.999997139 %
Test Cost: 15083.0833307
Test accuracy: 81.8799999356 %

Yikes. There are two things that jump out at me when I look at these numbers:

The cost seems very high despite achieving a reasonable result.
The test accuracy has decreased when compared to our fully-connected network which achieved an accuracy of ~89%

So are convolutional nets broken? Was all this effort for nothing? Not quite. Next time we’ll look at an underlying problem with how we’re choosing our initial random weight values and an improved strategy that should improve our results beyond that of our fully-connected network.

Complete Code

	import tensorflow as tf
	import numpy as np
	from tensorflow.examples.tutorials.mnist import input_data
	mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

	train_images = np.reshape(mnist.train.images, (-1, 28, 28, 1))
	train_labels = mnist.train.labels
	test_images = np.reshape(mnist.test.images, (-1, 28, 28, 1))
	test_labels = mnist.test.labels


	graph = tf.Graph()
	with graph.as_default():
	input = tf.placeholder(tf.float32, shape=(None, 28, 28, 1))
	labels = tf.placeholder(tf.float32, shape=(None, 10))

	layer1_weights = tf.Variable(tf.random_normal([3, 3, 1, 64]))
	layer1_bias = tf.Variable(tf.zeros([64]))
	layer1_conv = tf.nn.conv2d(input, filter=layer1_weights, strides=[1,1,1,1], padding='SAME')
	layer1_out = tf.nn.relu(layer1_conv + layer1_bias)

	layer2_weights = tf.Variable(tf.random_normal([3, 3, 64, 64]))
	layer2_bias = tf.Variable(tf.zeros([64]))
	layer2_conv = tf.nn.conv2d(layer1_out, filter=layer2_weights, strides=[1,1,1,1], padding='SAME')
	layer2_out = tf.nn.relu(layer2_conv + layer2_bias)

	pool1 = tf.nn.max_pool(layer2_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID')

	layer3_weights = tf.Variable(tf.random_normal([3, 3, 64, 128]))
	layer3_bias = tf.Variable(tf.zeros([128]))
	layer3_conv = tf.nn.conv2d(pool1, filter=layer3_weights, strides=[1,1,1,1], padding='SAME')
	layer3_out = tf.nn.relu(layer3_conv + layer3_bias)

	layer4_weights = tf.Variable(tf.random_normal([3, 3, 128, 128]))
	layer4_bias = tf.Variable(tf.zeros([128]))
	layer4_conv = tf.nn.conv2d(layer3_out, filter=layer4_weights, strides=[1,1,1,1], padding='SAME')
	layer4_out = tf.nn.relu(layer4_conv + layer4_bias)

	pool2 = tf.nn.max_pool(layer4_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID')

	shape = pool2.shape.as_list()
	fc = shape[1] * shape[2] * shape[3]
	reshape = tf.reshape(pool2, [-1, fc])
	fc_weights = tf.Variable(tf.random_normal([fc, 10]))
	fc_bias = tf.Variable(tf.zeros([10]))
	logits = tf.matmul(reshape, fc_weights) + fc_bias

	cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))

	learning_rate = 0.0000001
	optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

	#Add a few nodes to calculate accuracy and optionally retrieve predictions
	predictions = tf.nn.softmax(logits)
	correct_prediction = tf.equal(tf.argmax(labels, 1), tf.argmax(predictions, 1))
	accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

	with tf.Session(graph=graph) as session:
	tf.global_variables_initializer().run()

	num_steps = 5000
	batch_size = 100
	for step in range(num_steps):
	offset = (step * batch_size) % (train_labels.shape[0] – batch_size)
	batch_images = train_images[offset:(offset + batch_size), :]
	batch_labels = train_labels[offset:(offset + batch_size), :]
	feed_dict = {input: batch_images, labels: batch_labels}

	_, c, acc = session.run([optimizer, cost, accuracy], feed_dict=feed_dict)

	if step % 100 == 0:
	print("Cost: ", c)
	print("Accuracy: ", acc * 100.0, "%")


	#Test
	num_test_batches = int(len(test_images) / 100)
	total_accuracy = 0
	total_cost = 0
	for step in range(num_test_batches):
	offset = (step * batch_size) % (train_labels.shape[0] – batch_size)
	batch_images = test_images[offset:(offset + batch_size)]
	batch_labels = test_labels[offset:(offset + batch_size)]
	feed_dict = {input: batch_images, labels: batch_labels}

	c, acc = session.run([cost, accuracy], feed_dict=feed_dict)
	total_cost = total_cost + c
	total_accuracy = total_accuracy + acc

	print("Test Cost: ", total_cost / num_test_batches)
	print("Test accuracy: ", total_accuracy * 100.0 / num_test_batches, "%")

view raw

ltfn_5_full.py

hosted with ❤ by GitHub

Published by joshvarty

View all posts by joshvarty

Published February 12, 2018March 14, 2018

2 thoughts on “LTFN 5: Building a ConvNet”

Dinesh Kumar says:

February 15, 2018 at 9:47 am

Good simple to follow CNN for a beginner like me. Waiting for next post. Awesome work 🙂

Reply
1. joshvarty says:
  
  February 15, 2018 at 8:57 pm
  
  Thanks for the kind words! I’m working on the next post and hope to have it up in the next couple of days.
  
  Reply

Leave a comment Cancel reply