LTFN 12: Bias and Variance

Part of the series Learn TensorFlow Now

In the last few posts we noticed a strange phenomenon: our test accuracy was about 10% worse than what we were getting on our training set. Let’s review the results from our last network:

Cost: 131.964
Accuracy: 11.9999997318 %
...
Cost: 0.47334
Accuracy: 83.9999973774 %
Test Cost: 1.04789093912
Test accuracy: 72.5600001812 %

Our neural network is getting ~84% accuracy on the training set but only ~73% on the test set. What’s going on and how do we fix it?

Bias and Variance

Two primary sources of error in any machine learning algorithm come from either underfitting or overfitting your training data. Underfitting occurs when an algorithm is unable to model the underlying trend of the data. Overfitting occurs when the algorithm essentially memorizes the training set but is unable to generalize and performs poorly on the test set.

Bias is error introduced by underfitting a dataset. It is characterized by poor performance on both the training set and the test set.

Variance is error introduced by overfitting a dataset. It is characterized by a good performance on the training set, but a poor performance on test set.

We can look at bias and variance visually by comparing the performance of our network on the training set and test set. Recall our training accuracy of 84% and test accuracy of 73%:

Visualization of bias and variance from our previous network’s results

The above image roughly demonstrates which portions of our error can be attributed to bias and variance. This visualization assumes that we could theoretically achieve 100% accuracy. In practice this may not always be the case as other sources of error (eg. noise or mislabelled examples) may creep into our dataset. As an aside, the lowest theoretical error rate on a given problem is called the Bayes Error Rate.

Reducing Error

Ideally we would have a high performance on both the test set and training set which would represent low bias and low variance. So what steps can we take to reduce each of these sources of error?

Reducing Bias

Create a larger neural network. Recall that high bias is a sign that our neural network is unable to properly capture the underlying trend in our dataset. In general the deeper a network, the more complex the functions it can represent.
Train it for a very long time. One sanity check for any neural network is to see whether or not it can memorize the dataset. A sufficiently deep neural network should be able to memorize your dataset given enough training time. Although this won’t fix any problems with variance it can be an assurance that your network isn’t completely broken in some way.
Use a different architecture. Sometimes your chosen architecture may simply be unable to perform well on a given task. It may be worth considering other architectures to see if they perform better. A good place to start with Image Recognition tasks is to try different architectures submitted to previous ImageNet competitions.

Reducing Variance

Get more data. One nice property of neural networks is that they typically generalize better and better as you feed them more data. If your model is having problems handling out-of-sample data one obvious solution is to feed it more data.
Augment your existing data. While “Get more data” is a simple solution, it’s often not easy in practice. It can take months to curate, clean and verify a large dataset. One workaround is to artifically generate “new” data by augmenting your existing data. For image recognition tasks this might include flipping or rotating existing images, tweaking color settings or taking random crops of images. This is a topic we’ll explore in greater depth in future posts.
Regularization. High variance with low bias suggests our network has memorized the training set. Regularization describes a class of modifications we can make to our neural network that either penalizes memorization (eg. L2 regularization) or promotes redundant paths of learning in our network (ie. Dropout). We will dive deeper into various regularization approaches in future posts.
Use a different architecture. Like reducing bias, sometimes you get the most bang-for-your-buck when you switch architectures altogether. As the deep learning field grows, people are frequently discovering better architectures for certain tasks. Some recent papers have even suggested that the structure of a neural network is more important than any learned weights for that structure.

There’s a lot to unpack here and we’ve glossed over many of the solutions to the problems of bias and variance. In the next few posts we’re going to revisit some of these ideas and explore different areas of the TensorFlow API that allow us to tackle these problems.

LTFN 11: Image Pre-processing

Part of the series Learn TensorFlow Now

In previous posts, we simply passed raw images to our neural network. Other forms of machine learning pre-process input in various ways, so it seems reasonable to look at these approaches and see if they would work when applied to a neural network for image recognition.

Zero Centered Mean

One characteristic we desire from any learning algorithm is for it to generalize across different input distributions. For example, let’s imagine we design an algorithm for predicting whether or not the price of a house is “High” or “Low“. As input it takes:

Number of Rooms
Price of House

Below is some made-up data for the city of Boston. I’ve marked “High” in red, “Low” in blue and a reasonable decision boundary that our algorithm might learn in black. Our decision boundary correctly classifies all examples of “High” and “Low“.

Classification of house prices in Boston

What happens when we take this model and apply it to houses in New York where houses are much more expensive? Below we can see that the model does not generalize and incorrectly classifies many “Low” house prices as “High“.

Classification of house prices in New York

In order to fix this, we want to take all of our data and zero-center it. To do this, we subtract the mean of each feature from from each data-point. For our examples this would look something like:

Zero centering the mean for Boston housing data

Zero centering the mean for New York housing data

Notice that we zero-center the mean for both the “Price” feature as well as the “Number of Rooms” feature. In general we don’t know which features might cause problems and which ones will not, so it’s easier just to zero-center them all.

Now that our data has a zero-centered mean, we can see how it would be easier to draw a single decision boundary that would accurately classify points from both Boston and New York. Zero centering our mean is one technique for handling data that comes from different distributions.

Changing Distributions in Images

It’s easy to see how the distribution of housing prices changes in different cities, but what would changes in distribution look like when we’re talking about images? Let’s imagine that we’re building an image classifier to distinguish between pictures of cats and pictures of dogs. Below is some sample data:

Training Data

Training images for our Cat vs. Dog classifier

Test Data

In the above classification task our cat images are coming from different distributions in our training and test sets. Our training set seems to contain exclusively black cats while our test set has a mix of colors. We would expect our classifier to fail on this task unless we take some time to fix our distribution problems. One way to fix this problem would be to fix our training set and ensure it contains many different colors of cats. Another approach we might take would be to zero-center the images, as we did with our housing prices.

Zero Centering Images

Now that we understand zero-centered means, how can we use this to improve our neural network? Recall that each pixel in an image is a feature, analogous to “Price” or “Number of Rooms” in our housing example. Therefore, we have to calculate the mean value for each pixel across the entire dataset. This gives us a 32x32x3 “mean image” which we can then subtract from every image we pass to our neural network.

You mean have noticed that the mean_image was automatically created for us when we called cifar_data_loader.load_data():

	(train_images, train_labels, test_images, test_labels, mean_image) = cifar_data_loader.load_data()
	print(mean_image.shape) #32x32x3

view raw

ltfn_11_1.py

hosted with ❤ by GitHub

The mean image for the CIFAR-10 dataset looks something like:

Now we simply need to subtract the mean image from the input images in our neural network:

	input_minus_mean = input – mean_image #Subtract mean from input images

	layer1_weights = tf.get_variable("layer1_weights", [3, 3, 3, 64], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer1_bias = tf.Variable(tf.zeros([64]))
	layer1_conv = tf.nn.conv2d(input_minus_mean, filter=layer1_weights, strides=[1,1,1,1], padding='SAME') #Use input_minus_mean now
	layer1_out = tf.nn.relu(layer1_conv + layer1_bias)

view raw

ltfn_11_2.py

hosted with ❤ by GitHub

After running our network we’re greeted with the following output:

Cost: 131.964
Accuracy: 11.9999997318 %
Cost: 1.91737
Accuracy: 23.9999994636 %
Cost: 1.7101
Accuracy: 33.0000013113 %

...

Cost: 0.494887
Accuracy: 86.0000014305 %
Cost: 0.47334
Accuracy: 83.9999973774 %
Test Cost: 1.04789093912
Test accuracy: 72.5600001812 %

A test accuracy of 72.5% is a marginal increase over our previous result of 70.9% and it’s possible that our improvement is entirely due to chance. So why doesn’t zero centering the mean help much? Recall that zero-centering the mean leads to the biggest improvements when our data comes from different distributions. In the case of CIFAR-10, we have little reason to suspect that our portions of our images are obviously of different distributions.

Despite seeing only marginal improvements, we’ll continue to subtract the mean image from our input images. It imposes only a very small performance penalty and safeguards us against problems with distributions we might not anticipate in future datasets.

LTFN 9: Saving and Restoring

Part of the series Learn TensorFlow Now

In the last post we looked at a modified version of VGGNet that achieved ~97.8% accuracy recognizing handwritten digits. Now that we’re relatively satisfied with our network, we’d like to save a trained version of the network that we can restore and use to classify digits whenever we’d like. We’ll do so by saving all of the tf.Variables() we’ve created to a checkpoint (.ckpt) file.

Saving a Checkpoint

When we save our computational graph, we serialize both the graph itself and the values of all of our parameters. When serializing nodes in our graph, TensorFlow keeps track of their names in order for us to interact with them later. Nodes that we don’t name will receive default names and be very hard to pick out. (While preparing this post I forgot to name input and labels which received the names Placeholder and Placeholder_1 instead). For this reason, we’ll take a minute to ensure that we give names to input, labels, cost, accuracy and predictions.

	input = tf.placeholder(tf.float32, shape=(None, 28, 28, 1), name="input")
	labels = tf.placeholder(tf.float32, shape=(None, 10), name="labels")

	…

	cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels), name="cost")

	…

	predictions = tf.nn.softmax(logits, name="predictions")
	correct_prediction = tf.equal(tf.argmax(labels, 1), tf.argmax(predictions, 1))
	accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name="accuracy")

view raw

ltfn_9_0.py

hosted with ❤ by GitHub

Saving a single checkpoint is straightforward. If we just want to save the state of our network after training then we simply add the following lines to the end of our previous network:

	saver = tf.train.Saver() #Create a saver
	save_path = saver.save(session, "/tmp/vggnet/vgg_net.ckpt") #Specify where to save the model
	print("Saved model at: ", save_path) #Confirm the saved location

view raw

ltfn_9_1.py

hosted with ❤ by GitHub

This snippet of code first creates a tf.train.Saver, an object that coordinates both saving and restoration of models. Next we call saver.save() passing in the current session. As a refresher, this session contains information about both the structure of the computational graph as well as the exact values of all parameters. By default the saver saves all tf.Variables() (weight/bias parameters) from our graph, but it also has the ability to save only portions of the graph.

After saving the checkpoint, the saver returns the save_path. Why return the save_path if we just provided it with a path? The saver also allows you to shard the saved checkpoint by device (eg. using multiple GPUs to train a model). In this situation, the returned save_path is appended with information on the number of shards created.

After running this code, we can navigate to the folder /tmp/vggnet/ and run ls -tralh to look at the contents:

-rw-rw-r--  1 jovarty jovarty 184M Mar 12 19:57 vgg_net.ckpt.data-00000-of-00001
-rw-rw-r--  1 jovarty jovarty 2.7K Mar 12 19:57 vgg_net.ckpt.index
-rw-rw-r--  1 jovarty jovarty  105 Mar 12 19:57 checkpoint
-rw-rw-r--  1 jovarty jovarty 188K Mar 12 19:57 vgg_net.ckpt.meta

The first file vgg_net.ckpt.data-00000-of-00001 is 184 MB in size and contains the values of all of our parameters. This is a reasonably large size and one of the reasons it’s nice to use networks with smaller numbers of parameters. This model is larger than most of the apps on my phone so it could be difficult to deploy to mobile devices.

The vgg_net.ckpt.meta file contains information on the structure of our computational graph and the names of all of our nodes. Later we’ll use this file to rebuild our computational graph from scratch.

Saving Multiple Checkpoints

Some neural networks are trained over the course of multiple weeks and we would like a way to periodically take checkpoints as our network learns. This allows us to go back in time and hand tune hyperparameters such as learning rate to try to squeeze the best performance out of our network. Fortunately, TensorFlow makes it easy to take checkpoints at any point during training. For example, we can modify our training loop to simply save a checkpoint whenever we print accuracy and cost.

	saver = tf.train.Saver() #Create saver

	num_steps = 1000
	batch_size = 100
	for step in range(num_steps):
	offset = (step * batch_size) % (train_labels.shape[0] – batch_size)
	batch_images = train_images[offset:(offset + batch_size), :]
	batch_labels = train_labels[offset:(offset + batch_size), :]
	feed_dict = {input: batch_images, labels: batch_labels}

	_, c, acc = session.run([optimizer, cost, accuracy], feed_dict=feed_dict)

	if step % 100 == 0:
	print("Cost: ", c)
	print("Accuracy: ", acc * 100.0, "%")
	saver.save(session, "/tmp/vggnet/vgg_net.ckpt", global_step=step) #Save session every 100 mini-batches

view raw

ltfn_9_2.py

hosted with ❤ by GitHub

The only real modification we’ve made here is to pass in global_step=step to track when each checkpoint was created. Be aware that this can eat up disk space relatively quickly depending on the size of your model. Each of our VGG checkpoints requires 184 MB of space.

Restoring a Model

Now that we know how to save our model’s parameters, how do we restore them? One way is to declare the original computational graph in Python and then restore the values to all the tf.Variables() (parameters) using tf.train.Saver.

For example, we could remove the training and testing code from our previous network and replace it with the following:

	with tf.Session() as session:
	#Restore Model
	saver = tf.train.Saver() #Create a saver (object to save/restore sessions)
	saver.restore(session, "/tmp/vggnet/vgg_net.ckpt") #Restore the session from a previously saved checkpoint

	#Now we test our restored model exactly as before
	batch_size = 100
	num_test_batches = int(len(test_images) / 100)
	total_accuracy = 0
	total_cost = 0
	for step in range(num_test_batches):
	offset = (step * batch_size) % (train_labels.shape[0] – batch_size)
	batch_images = test_images[offset:(offset + batch_size)]
	batch_labels = test_labels[offset:(offset + batch_size)]
	feed_dict = {input: batch_images, labels: batch_labels}

	c, acc = session.run([cost, accuracy], feed_dict=feed_dict)
	total_cost = total_cost + c
	total_accuracy = total_accuracy + acc

	print("Test Cost: ", total_cost / num_test_batches)
	print("Test accuracy: ", total_accuracy * 100.0 / num_test_batches, "%")

view raw

ltfn_9_3.py

hosted with ❤ by GitHub

There are really only two additions to the code here:

Create the tf.train.Saver()
Restore the model to the current session. Note: This portion requires the graph to have been defined with identical names and parameters as when they were saved to a checkpoint.

Other than these changes, we test the network exactly as we would have before. If we wanted to test our network on new examples, we could load them into test_images and retrieve predictions from our graph instead of cost and accuracy.

This approach works well for networks we’ve built ourselves but it can be very cumbersome when we want to run networks designed by someone else. It takes hours to manually create each parameter and operation exactly as the original author had.

Restoring a Model from Scratch

One approach to using someone else’s neural network is to load up the computational graph defined in the .meta file before restoring the values to this graph from the .ckpt file. Below is a self-contained example of restoring a model from scratch:

	import tensorflow as tf
	import numpy as np
	from tensorflow.examples.tutorials.mnist import input_data
	mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

	test_images = np.reshape(mnist.test.images, (-1, 28, 28, 1))
	test_labels = mnist.test.labels

	graph = tf.Graph()
	with tf.Session(graph=graph) as session:
	saver = tf.train.import_meta_graph('/tmp/vggnet/vgg_net.ckpt.meta') #Create a saver based on a saved graph
	saver.restore(session, '/tmp/vggnet/vgg_net.ckpt') #Restore the values to this graph

	input = graph.get_tensor_by_name("input:0") #Get access to the input node
	labels = graph.get_tensor_by_name("labels:0") #Get access to the labels node

	batch_size = 100
	num_test_batches = int(len(test_images) / 100)
	total_accuracy = 0
	total_cost = 0
	for step in range(num_test_batches):
	offset = (step * batch_size) % (test_labels.shape[0] – batch_size)
	batch_images = test_images[offset:(offset + batch_size)]
	batch_labels = test_labels[offset:(offset + batch_size)]
	feed_dict = {input: batch_images, labels: batch_labels}

	c, acc = session.run(['cost:0', 'accuracy:0'], feed_dict=feed_dict) #Note: We pass in strings 'cost:0' and 'accuracy:0'
	total_cost = total_cost + c
	total_accuracy = total_accuracy + acc

	print("Test Cost: ", total_cost / num_test_batches)
	print("Test accuracy: ", total_accuracy * 100.0 / num_test_batches, "%")

view raw

ltfn_9_4.py

hosted with ❤ by GitHub

There are a few subtle changes worth pointing out. First, we create our tf.train.Saver indirectly by importing the computational graph with tf.train.import_meta_graph(). Next, we restore the values to our computational graph with saver.restore() exactly as we had done previously.

	saver = tf.train.import_meta_graph('/tmp/vggnet/vgg_net.ckpt.meta')
	saver.restore(session, '/tmp/vggnet/vgg_net.ckpt')

view raw

ltfn_9_5.py

hosted with ❤ by GitHub

Since we don’t have access to the input and labels nodes, we have to recover them from our graph with graph.get_tensor_by_name(). Notice that we are passing in the names that we had previously specified and appending :0 to these names. Some TensorFlow operations produce multiple outputs. When this happens, TensorFlow names them :0, :1 and so on until all the outputs have a unique name. All of the operations we’re using have only one output so we simply stick with :0.

	input = graph.get_tensor_by_name("input:0")
	labels = graph.get_tensor_by_name("labels:0")

view raw

ltfn_9_6.py

hosted with ❤ by GitHub

Finally, the last change involves actually running the network. As in the previous step, we need to specify proper names for cost and accuracy because we don’t have direct access to the computational nodes. Fortunately, it’s simple to just pass in strings with the names 'cost:0' and 'accuracy:0' that specify which operations we want to run and return the values of. Alternatively, we could have recovered the nodes with graph.get_tensor_by_name() and passed them in directly.

c, acc = session.run(['cost:0', 'accuracy:0'], feed_dict=feed_dict)

view raw

ltfn_9_7.py

hosted with ❤ by GitHub

Also note that if we had named our optimizer, we could have passed it into session.run() and continued to train our network. We could have even created a checkpoint of our saved network at this point if we decided it had improved in some way.

There are a variety of ways to save and restore models and we’ve really only scratched the surface. Below are a few self-contained examples of the various approaches we’ve looked at:

LTFN 8: Deeper ConvNets

Part of the series Learn TensorFlow Now

Now that we’ve got a handle on convolutions, max pooling and weight initialization the obvious question is: What’s next? How should we set up our network to achieve the maximum accuracy on image recognition tasks? For years this has been a focus of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) competitions. Since 2010 researchers have battled various architectures against one another in an attempt to categorize millions of images into 1,000 categories. When tackling any image recognition task it’s usually a good idea to pick one of the top performing architectures instead of trying to craft your own from scratch.

VGGNet

VGGNet is a nice starting point as it’s simply a deeper version of the network we’ve been building. Its debut in the 2013 ILSVRC competition was novel due to its exclusive use of 3x3 convolutional filters. Previous architectures had attempted to use a variety of filter sizes including 11x11, 7x7 and 5x5. Each of these filter sizes was a hyper-parameter that had to be tuned so it was a relief to see high performance with both a consistent and small filter size.

As with our previous network, VGG operates by staggering max-pooling layers between groups of convolutional layers. Below is a table listing the 16 layers of VGG alongside the intermediate shapes at each layer of the network and the number of trainable parameters (ie. weights, excluding biases) in the network.

Original VGGNet

Layers		Parameters
Layer Shape	Intermediate Shape
	Input: 224x224x3
64 3×3 Conv Filters	224 x 224 x 64	64 * 3 * 3 * 3 = 1,728
64 3×3 Conv Filters	224 x 224 x 64	64 * 3 * 3 * 64 = 36,864
maxpool 2×2	112 x 112 x 64
128 3×3 Conv Filters	112 x 112 x 128	128 * 3 * 3 * 64 = 73,728
128 3×3 Conv Filters	112 x 112 x 128	128 * 3 * 3 * 128 = 147,456
maxpool 2×2	56 x 56 x 256
256 3×3 Conv Filters	56 x 56 x 256	256 * 3 * 3 * 128 = 294,912
256 3×3 Conv Filters	56 x 56 x 256	256 * 3 * 3 * 256 = 589,824
256 3×3 Conv Filters	56 x 56 x 256	256 * 3 * 3 * 256 = 589,824
maxpool 2×2	28 x 28 x 256
512 3×3 Conv Filters	28 x 28 x 512	512 * 3 * 3 * 256 = 1,179,648
512 3×3 Conv Filters	28 x 28 x 512	512 * 3 * 3 * 512 = 2,359,296
512 3×3 Conv Filters	28 x 28 x 512	512 * 3 * 3 * 512 = 2,359,296
maxpool	14 x 14 x 512
512 3×3 Conv Filters	14 x 14 x 512	512 * 3 * 3 * 512 = 2,359,296
512 3×3 Conv Filters	14 x 14 x 512	512 * 3 * 3 * 512 = 2,359,296
512 3×3 Conv Filters	14 x 14 x 512	512 * 3 * 3 * 512 = 2,359,296
maxpool	7 x 7 x 512
FC 4096	1 x 1 x 4096	7 * 7 * 512 * 4096 = 102,760,448
FC 4096	1 x 1 x 4096	4096 * 4096 = 16,777,216
FC 1000	1 x 1 x 1000	4096 * 1000 = 4,096,000

A few things to note about the VGG architecture:

It was originally built for images of size 224x224x3 and 1,000 output classes.
The number of parameters increases exponentially as we move through the network.
There are so many trainable parameters that we can only reasonably run such a network on a computer with a GPU.

There are a couple of modifications we’ll make to the VGG network in order to use it on our MNIST digits of shape 28x28x1. Notice that after each max_pooling layer we halve the width and height dimensions. Unfortunately, our images just aren’t big enough to go through so many max_pooling layers. For this reason, we’ll omit the final max_pooling layer and the final three 512 3x3 convolutional layers. We’ll also pad our 28x28 images to be of size 32x32 so the widths and heights divide by two cleanly.

Modified VGGNet

Layers		Parameters
Layer Shape	Intermediate Shape
	Input: 28 x 28 x 1
Pad Image	32 x 32 x 1
64 3×3 Conv Filters	32 x 32 x 64	64 * 3 * 3 * 3 = 1,728
64 3×3 Conv Filters	32 x 32 x 64	64 * 3 * 3 * 64 = 36,864
maxpool 2×2	16 x 16 x 64
128 3×3 Conv Filters	16 x 16 x 128	128 * 3 * 3 * 64 = 73,728
128 3×3 Conv Filters	16 x 16 x 128	128 * 3 * 3 * 128 = 147,456
maxpool 2×2	8 x 8 x 256
256 3×3 Conv Filters	8 x 8 x 256	256 * 3 * 3 * 128 = 294,912
256 3×3 Conv Filters	8 x 8 x 256	256 * 3 * 3 * 256 = 589,824
256 3×3 Conv Filters	8 x 8 x 256	256 * 3 * 3 * 256 = 589,824
maxpool 2×2	4 x 4 x 256
512 3×3 Conv Filters	4 x 4 x 512	512 * 3 * 3 * 256 = 1,179,648
512 3×3 Conv Filters	4 x 4 x 512	512 * 3 * 3 * 512 = 2,359,296
512 3×3 Conv Filters	4 x 4 x 512	512 * 3 * 3 * 512 = 2,359,296
maxpool	2 x 2 x 512
FC 4096	1 x 1 x 4096	2 * 2 * 512 * 4096 = 8,388,608
FC 10	1 x 1 x 10	4096 * 10 = 40,960

In previous posts we’ve encountered fully connected layers, convolutional layers and max pooling operations. The only portion of this network we’ve not seen before is the initial padding step. TensorFlow makes this easy to accomplish via tf.image.resize_image_with_crop_or_pad.

	input = tf.placeholder(tf.float32, shape=(None, 28, 28, 1)) #28x28x1
	padded_input = tf.image.resize_image_with_crop_or_pad(input, target_height=32, target_width=32) #32x32x1

view raw

ltfn_8_1.py

hosted with ❤ by GitHub

We’ll also make use of the tf.train.AdamOptimizer discussed in the previous post:

	learning_rate = 0.001
	optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

view raw

ltfn_8_2.py

hosted with ❤ by GitHub

With these two changes, we can create our modified version of VGGNet, presented in full at the end of this post.

Running our network gives us the following output:

Cost: 3.19188
Accuracy: 10.9999999404 %
Cost: 0.140771
Accuracy: 94.9999988079 %
Cost: 0.120058
Accuracy: 95.9999978542 %
Cost: 0.128447
Accuracy: 97.000002861 %
Cost: 0.0849798
Accuracy: 95.9999978542 %
Cost: 0.0180758
Accuracy: 99.0000009537 %
Cost: 0.0622907
Accuracy: 99.0000009537 %
Cost: 0.147945
Accuracy: 95.9999978542 %
Cost: 0.0502743
Accuracy: 99.0000009537 %
Cost: 0.149534
Accuracy: 99.0000009537 %
Test Cost: 0.0713789960416
Test accuracy: 97.8600007892 %

Running this network gives us a test accuracy of ~97.9% compared to our previous best of 97.3%. This is an improvement, but we’re starting to see fairly marginal improvements. In fact, I wouldn’t necessarily be convinced that our VGG network truly outperforms our previous best without running each network multiple times and comparing the average accuracies achieved. There’s a very real possibility that our small improvement may have just been due to chance. We won’t run this comparison here, but it’s something to consider when you’re starting to see very marginal improvements in your own networks.

Next week we’ll take a look at saving and restoring our model and we’ll take a look at some of the images on which our network is making mistakes in order to build a better intuition for what might be going on.

Complete Code

	import tensorflow as tf
	import numpy as np
	from tensorflow.examples.tutorials.mnist import input_data
	mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

	train_images = np.reshape(mnist.train.images, (-1, 28, 28, 1))
	train_labels = mnist.train.labels
	test_images = np.reshape(mnist.test.images, (-1, 28, 28, 1))
	test_labels = mnist.test.labels

	graph = tf.Graph()
	with graph.as_default():
	input = tf.placeholder(tf.float32, shape=(None, 28, 28, 1))
	labels = tf.placeholder(tf.float32, shape=(None, 10))

	padded_input = tf.image.resize_image_with_crop_or_pad(input, target_height=32, target_width=32)

	layer1_weights = tf.get_variable("layer1_weights", [3, 3, 1, 64], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer1_bias = tf.Variable(tf.zeros([64]))
	layer1_conv = tf.nn.conv2d(padded_input, filter=layer1_weights, strides=[1,1,1,1], padding='SAME')
	layer1_out = tf.nn.relu(layer1_conv + layer1_bias)

	layer2_weights = tf.get_variable("layer2_weights", [3, 3, 64, 64], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer2_bias = tf.Variable(tf.zeros([64]))
	layer2_conv = tf.nn.conv2d(layer1_out, filter=layer2_weights, strides=[1,1,1,1], padding='SAME')
	layer2_out = tf.nn.relu(layer2_conv + layer2_bias)

	pool1 = tf.nn.max_pool(layer2_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID')

	layer3_weights = tf.get_variable("layer3_weights", [3, 3, 64, 128], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer3_bias = tf.Variable(tf.zeros([128]))
	layer3_conv = tf.nn.conv2d(pool1, filter=layer3_weights, strides=[1,1,1,1], padding='SAME')
	layer3_out = tf.nn.relu(layer3_conv + layer3_bias)

	layer4_weights = tf.get_variable("layer4_weights", [3, 3, 128, 128], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer4_bias = tf.Variable(tf.zeros([128]))
	layer4_conv = tf.nn.conv2d(layer3_out, filter=layer4_weights, strides=[1,1,1,1], padding='SAME')
	layer4_out = tf.nn.relu(layer4_conv + layer4_bias)

	pool2 = tf.nn.max_pool(layer4_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID')

	layer5_weights = tf.get_variable("layer5_weights", [3, 3, 128, 256], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer5_bias = tf.Variable(tf.zeros([256]))
	layer5_conv = tf.nn.conv2d(pool2, filter=layer5_weights, strides=[1,1,1,1], padding='SAME')
	layer5_out = tf.nn.relu(layer5_conv + layer5_bias)

	layer6_weights = tf.get_variable("layer6_weights", [3, 3, 256, 256], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer6_bias = tf.Variable(tf.zeros([256]))
	layer6_conv = tf.nn.conv2d(layer5_out, filter=layer6_weights, strides=[1,1,1,1], padding='SAME')
	layer6_out = tf.nn.relu(layer6_conv + layer6_bias)

	layer7_weights = tf.get_variable("layer7_weights", [3, 3, 256, 256], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer7_bias = tf.Variable(tf.zeros([256]))
	layer7_conv = tf.nn.conv2d(layer6_out, filter=layer7_weights, strides=[1,1,1,1], padding='SAME')
	layer7_out = tf.nn.relu(layer7_conv + layer7_bias)

	pool3 = tf.nn.max_pool(layer7_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID')

	layer8_weights = tf.get_variable("layer8_weights", [3, 3, 256, 512], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer8_bias = tf.Variable(tf.zeros([512]))
	layer8_conv = tf.nn.conv2d(pool3, filter=layer8_weights, strides=[1,1,1,1], padding='SAME')
	layer8_out = tf.nn.relu(layer8_conv + layer8_bias)

	layer9_weights = tf.get_variable("layer9_weights", [3, 3, 512, 512], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer9_bias = tf.Variable(tf.zeros([512]))
	layer9_conv = tf.nn.conv2d(layer8_out, filter=layer9_weights, strides=[1,1,1,1], padding='SAME')
	layer9_out = tf.nn.relu(layer9_conv + layer9_bias)

	layer10_weights = tf.get_variable("layer10_weights", [3, 3, 512, 512], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer10_bias = tf.Variable(tf.zeros([512]))
	layer10_conv = tf.nn.conv2d(layer9_out, filter=layer10_weights, strides=[1,1,1,1], padding='SAME')
	layer10_out = tf.nn.relu(layer10_conv + layer10_bias)

	pool4 = tf.nn.max_pool(layer10_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID')

	shape = pool4.shape.as_list()
	newShape = shape[1] * shape[2] * shape[3]
	reshaped_pool4 = tf.reshape(pool4, [-1, newShape])

	fc1_weights = tf.get_variable("layer11_weights", [newShape, 4096], initializer=tf.contrib.layers.variance_scaling_initializer())
	fc1_bias = tf.Variable(tf.zeros([4096]))
	fc1_out = tf.nn.relu(tf.matmul(reshaped_pool4, fc1_weights) + fc1_bias)

	fc2_weights = tf.get_variable("layer12_weights", [4096, 10], initializer=tf.contrib.layers.xavier_initializer())
	fc2_bias = tf.Variable(tf.zeros([10]))
	logits = tf.matmul(fc1_out, fc2_weights) + fc2_bias

	cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))

	learning_rate = 0.001
	optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

	#Add a few nodes to calculate accuracy and optionally retrieve predictions
	predictions = tf.nn.softmax(logits)
	correct_prediction = tf.equal(tf.argmax(labels, 1), tf.argmax(predictions, 1))
	accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

	with tf.Session(graph=graph) as session:
	tf.global_variables_initializer().run()

	num_steps = 1000
	batch_size = 100
	for step in range(num_steps):
	offset = (step * batch_size) % (train_labels.shape[0] – batch_size)
	batch_images = train_images[offset:(offset + batch_size), :]
	batch_labels = train_labels[offset:(offset + batch_size), :]
	feed_dict = {input: batch_images, labels: batch_labels}

	_, c, acc = session.run([optimizer, cost, accuracy], feed_dict=feed_dict)

	if step % 100 == 0:
	print("Cost: ", c)
	print("Accuracy: ", acc * 100.0, "%")

	#Test
	num_test_batches = int(len(test_images) / 100)
	total_accuracy = 0
	total_cost = 0
	for step in range(num_test_batches):
	offset = (step * batch_size) % (train_labels.shape[0] – batch_size)
	batch_images = test_images[offset:(offset + batch_size)]
	batch_labels = test_labels[offset:(offset + batch_size)]
	feed_dict = {input: batch_images, labels: batch_labels}

	_, c, acc = session.run([optimizer, cost, accuracy], feed_dict=feed_dict)
	total_cost = total_cost + c
	total_accuracy = total_accuracy + acc

	print("Test Cost: ", total_cost / num_test_batches)
	print("Test accuracy: ", total_accuracy * 100.0 / num_test_batches, "%")

view raw

ltfn_8_full.py

hosted with ❤ by GitHub

LTFN 7: A Quick Look at TensorFlow Optimizers

Part of the series Learn TensorFlow Now

So far we’ve managed to avoid the mathematics of optimization and treated our optimizer as a “black box” that does its best to find good weights for our network. In our last post we saw that it doesn’t always succeed: We had three networks with identical structures but different initial weights and our optimizer failed to find good weights for two of them (when the initial weights were too large in magnitude and when they were too small in magnitude).

I’ve avoided the mathematics primarily because I believe one can become a machine learning practitioner (but probably not researcher) without a deep understanding of the mathematics underlying deep learning. We’ll continue that tradition and avoid the bulk of the mathematics behind the optimization algorithms. That said, I’ll provide links to resources where you can dive into these topics if you’re interested.

There are three optimization algorithms you should be aware of:

Stochastic Gradient Descent – The default optimizer we’ve been using so far
Momentum Update – An improved version of stochastic gradient descent
Adam Optimizer – Typically the best performing optimizer

Stochastic Gradient Descent

To keep things simple (and allow us to visualize what’s going on) let’s think about a network with just one weight. After we run our network on a batch of inputs we are given a cost. Our goal is to adjust the weight so as to minimize that cost. For example, the function could look something like the following (with our weight/cost highlighted):

A (completely made up) cost function, with a (completely made up) weight and corresponding `cost`

We can obviously look at this function and be confident that we want to increase weight_1. Ideally we’d just increase weight_1 to give us the cost at the bottom of the curve and be done after one step.

In reality, neither we nor the network have any idea of what the underlying function really looks like. We know three things:

The value of weight_1
The cost associated with our (one-weight) network
A rough estimate of how much we should increase or decrease weight_1 to get a smaller cost

(That third piece of information is where I’ve hidden most of the math and complexities of neural networks away. It’s the gradient of the network and it is computed for all weights of the network via back-propagation)

With these three things in mind, a better visualization might be:

It’s a lot harder to tell how far we should increase `weight_1` now, isn’t it?

So now we still know that we want to increase weight_1, but how much should we increase it? This is partially decided by learning_rate. Increasing learning_rate means that we adjust our weights by larger amounts.

The update step of stochastic gradient descent consists of:

Find out which direction we should adjust the weights
Adjust the weights by multiplying learning_rate by the gradient

	learning_rate = 0.01 #Some human-chosen learning rate
	gradient_for_weight_1 = … #Compute gradient
	weight_1 = weight_1 + (-gradient_for_weight1 * learning_rate) #Technically, the gradient tells us how to INCREASE cost, so we go the opposite direction by negating it

view raw

ltfn_7_1.py

hosted with ❤ by GitHub

We have been using this approach whenever we have been using tf.train.GradientDescentOptimizer.

Momentum Update

One problem with stochastic gradient descent is that it’s slow and can take a long time for the optimizer to converge on a good set of weights. One solution to this problem is to use momentum. Momentum simply means: “If we’ve been moving in the same direction for a long time, we should probably move faster and faster in that direction”.

We can accomplish this by adding a momentum factor (typically ~0.9) to our previous one-weight example:

	velocity = 0 #No initial velocity. (Defined outside of optimization loop)

	…

	momentum = 0.9
	learning_rate = 0.01 #Some human-chosen learning rate
	gradient_for_weight_1 = … #Compute gradient
	velocity = (momentum * velocity) – (gradient_for_weight_1 * learning_rate) #Maintain a velocity that keeps increasing if we don't change direction
	weight_1 = weight_1 + velocity

view raw

ltfn_7_2.py

hosted with ❤ by GitHub

We use velocity to keep track of the speed and direction in which weight_1 is increasing or decreasing. In general, momentum update works much better that stochastic gradient descent. For a math-focused look at why see: Why Momentum Works.

The TensorFlow momentum update optimizer is available at tf.train.MomentumOptimizer.

Adam Optimizer

The Adam Optimizer is my personal favorite optimizer simply because it seems to work the best. It combines the approaches of multiple optimizers we haven’t looked at so we’ll leave out the math and instead show a comparison of Adam, Momentum and SGD below:

Instead of using just one weight, this example uses two weights: x and y. Cost is represented on the z axis with blue colors representing smaller values and the star represented the global minimum.

Things to note:

SGD is very slow. It doesn’t make it to the minima in the 120 training steps
Momentum sometimes overshoots its target
Adam seems to offer a somewhat reasonable balance between the two

The Adam Optimizer is available at tf.train.AdamOptimizer.

Additional Resources:

LTFN 6: Weight Initialization

Part of the series Learn TensorFlow Now

At the conclusion of the previous post, we realized that our first convolutional net wasn’t performing very well. It had a comparatively high cost (something we hadn’t seen before) and was performing slightly worse than a fully-connected network with the same number of layers.

Test results from 4-layer fully connected network:

Test Cost:  107.98408660641905
Test accuracy:  85.74999994039536 %

Test results from 4-layer Conv Net:

Test Cost: 15083.0833307
Test accuracy: 81.8799999356 %

As a refresher, here’s a visualization of the 4-layer ConvNet we built in the last post:

Visualization of layers from `input` through `pool2` (Click to enlarge).

So how do we figure out what’s broken?

When writing any typical program we might fire up a debugger or even just use something like printf() to figure out what’s going on. Unfortunately neural networks makes this very difficult for us. We can’t really step through thousands of multiplication, addition and ReLU operations and expect to glean much insight. One common debugging technique is to visualize all of the intermediate outputs and try to see if there are any obvious problems.

Let’s take a look at a histogram of the outputs of each layer before they’re passed through the ReLU non-linearity. (Remember, the ReLU operation simply chops off all negative values).

If you look closely at the above plots you’ll notice that the variance increases substantially at each layer (TensorBoard doesn’t let me adjust the scales of each plot so it’s not immediately obvious). The majority of outputs at layer1_conv are within the range [-1,1], but by the time we get to layer4_conv the outputs vary between [-20,000, 20,000]. If we continue adding layers to our network this trend will continue and eventually our network will run into problems with overflow. In general we’d prefer our intermediate outputs to remain within some fixed range.

How does this relate to our high cost? Let’s take a look at the values of our logits and predictions. Recall that these values are calculated via:

	#We flatten the last layer (pool2) and multiply it by a set of weights to produce 10 logits
	shape = pool2.shape.as_list()
	fc = shape[1] * shape[2] * shape[3] #7x7x256 = 6,272
	reshape = tf.reshape(pool2, [-1, fc])
	fc_weights = tf.Variable(tf.random_normal([fc, 10])) #6,272×10
	fc_bias = tf.Variable(tf.zeros([10])) #10

	#Logits are ten numbers
	logits = tf.matmul(reshape, fc_weights) + fc_bias #10
	#Predictions are ten numbers that are scaled to add to 1.00
	predictions = tf.nn.softmax(logits) #10

view raw

ltfn_6_1.py

hosted with ❤ by GitHub

The first thing to notice is that like the previous layers, the values of logits have a large variance with some values in the hundreds of thousands. The second thing to notice is that once we take the softmax of logits to create predictions all of our values are reduced to either 1 or 0. Recall that tf.nn.softmax takes logits and ensures that the ten values add up to 1 and that each value represents the probability a given image is represented by each digit. When some of our logits are tens of thousands of times bigger than the others, these values end up dominating the probabilities.

The visualization of predictions tells us that our network is super confident about the predictions it’s making. Essentially our network is claiming that it is 99% sure of its predictions. Whenever our network makes a mistake it is making a huge mistake and receives a large cost penalty for it.

The problem with increasing (magnitude) intermediate outputs translates directly into an increased cost. So how do we fix this? We want restrict the magnitude of the intermediate outputs of our network so they don’t increase so drastically at each layer.

Smaller Initial Weights

Recall that each convolution operation takes the dot product of our weights with a portion of the input. Basically, we’re multiplying and adding up a bunch of numbers similar to the following:

w₀*i₀ + w₁*i₁ + w₂*i₂ + … w_n*i_n = output

Where:

w_x – Represents a single weight
i_x – Represents a single input (eg. pixel)
n – The number of weights

One way to reduce the magnitude of this expression is to reduce the magnitude of all of our weights by some factor:

0.01*w₀*i₀ + 0.01*w₁*i₁ + 0.01*w₂*i₂ + … 0.01*w_n*i_n = 0.01*output

Let’s try it and see if it works! We’ll modify the creation of our weights by multiplying them all by 0.01. Therefore layer1_weights would now be defined as:

layer1_weights = tf.Variable(tf.random_normal([3, 3, 1, 64]) * 0.01)

view raw

ltfn_6_2.py

hosted with ❤ by GitHub

After changing all five sets of weights (don’t forget about the fully-connected layer at the end), we can run our network and see the following test cost and accuracies:

Test Cost:  2.3025865221
Test accuracy:  5.01999998465 %

Yikes! The cost has decreased quite a bit, but that accuracy is abysmal… What’s going on this time? Let’s take a look at the intermediate outputs of the network:

If you look closely at the scales, you’ll see that this time the intermediate outputs are decreasing! The first layer’s outputs lie largely within the interval [-0.02, 0.02] while the fourth layer generates outputs that lie within [-0.0002, 0.0002]. This is essentially the opposite of the problem we saw before.

Let’s also examine the logits and predictions as we did before:

This time the logits vary over a very small interval [-0.003, 0.003] and predictions are completely uniform. The predictions appear to be centered around 0.10 which seems to indicate that our network is simply predicting each of the ten digits with 10% probability. In other words, our network is learning nothing at all and we’re in an even worse state than before!

Choosing the Perfect Initial Weights

What we’ve learned so far:

Large initial weights lead to very large output in intermediate layers and an over-confident network.
Small initial weights lead to very small output in intermediate layers and a network that doesn’t learn anything.

So how do we choose initial weights that are not too small and not too large? In 2013, Xavier Glorot and Yoshua Bengio published Understanding the difficulty of training deep forward neural networks in which they proposed initializing a set of weights based on how many input and output nerons are present for a given weight. For more on this initialization scheme see An Explanation of Xavier Initialization. This initialization scheme is called Xavier Initialization.

It turns out that Xavier Initialization does not work for layers using the asymmetric ReLU activation function. So while we can use it on our fully connected layer we can’t use it for our intermediate layers. However in 2015 Microsoft Research (Kaiming He et al.) published Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In this paper they introduced a modified version of Xavier Initialization called Variance Scaling Initialization.

The math behind these initialization schemes is out of scope for this post, but TensorFlow makes them easy to use. I recommend simply remembering:

Use Xavier Initialization in the fully-connected layers of your network. (Or layers that use softmax/tanh activation functions)
Use Variance Scaling Initialization in the intermediate layer of your network that use ReLU activation functions.

We can modify the initialization of layer1_weights from tf.random.normal to use tf.contrib.layers.variance_scaling_initializer() as follows:

layer1_weights = tf.get_variable("layer1_weights", [3, 3, 1, 64], initializer=tf.contrib.layers.variance_scaling_initializer())

view raw

ltfn_6_3.py

hosted with ❤ by GitHub

We can also modify the fully connected layer’s weights to use tf.contrib.xavier_initializer as follows:

fully_connected_weights = tf.get_variable("fully_connected_weights", [fc, 10], initializer=tf.contrib.layers.xavier_initializer())

view raw

ltfn_6_4.py

hosted with ❤ by GitHub

There are a few small changes to note here. First, we use tf.get_variable instead of calling tf.Variable directly. This allows us to pass in a custom initializer for our weights. Second, we have to provide a unique name for our variable. Typically I just use the same name as my variable name.

If we continue changing all the weights in our network and run it, we can see the following output:

Cost: 2.49579
Accuracy: 9.00000035763 %
Cost: 1.05762
Accuracy: 77.999997139 %
...
Cost: 0.110656
Accuracy: 94.9999988079 %
Test Cost: 0.0945288215741
Test accuracy: 97.2900004387 %

Much better! This is a big improvement over our previous results and we can see that both cost and accuracy have improved substantially. For the sake of curiosity, let’s look at the intermediate outputs of our network:

This looks much better. The variance of the intermediate values appears to increase only slightly as we move through the layers and all values are within about an order of magnitude of one another. While we can’t make any claims about the intermediate outputs being “perfect” or even “good”, we can at least rest assured that there is no glaringly obvious problems with them. (Sidenote: This seems to be a common theme in deep learning: We usually can’t prove we’ve done things correctly, we can only look for signs that we’ve done them incorrectly).

Thoughts on Weights

Hopefully I’ve managed to convince you of the importance of choosing good initial weights for a neural network. Fortunately when it comes to image recognition, there are well-known initialization schemes that pretty much solve this problem for us.

The problems with weight initialization should highlight the fragility of deep neural networks. After all, we would hope that even if we choose poor initial weights, after enough time our gradient descent optimizer would manage to correct them and settle on good values for our weights. Unfortunately that doesn’t seem to be the case, and our optimizer instead settles into a relatively poor local minima.

Complete Code

	import tensorflow as tf
	import numpy as np
	import shutil
	from tensorflow.examples.tutorials.mnist import input_data
	mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

	train_images = np.reshape(mnist.train.images, (-1, 28, 28, 1))
	train_labels = mnist.train.labels
	test_images = np.reshape(mnist.test.images, (-1, 28, 28, 1))
	test_labels = mnist.test.labels

	graph = tf.Graph()
	with graph.as_default():
	input = tf.placeholder(tf.float32, shape=(None, 28, 28, 1))
	labels = tf.placeholder(tf.float32, shape=(None, 10))

	layer1_weights = tf.get_variable("layer1_weights", [3, 3, 1, 64], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer1_bias = tf.Variable(tf.zeros([64]))
	layer1_conv = tf.nn.conv2d(input, filter=layer1_weights, strides=[1,1,1,1], padding='SAME')
	layer1_out = tf.nn.relu(layer1_conv + layer1_bias)

	layer2_weights = tf.get_variable("layer2_weights", [3, 3, 64, 64], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer2_bias = tf.Variable(tf.zeros([64]))
	layer2_conv = tf.nn.conv2d(layer1_out, filter=layer2_weights, strides=[1,1,1,1], padding='SAME')
	layer2_out = tf.nn.relu(layer2_conv + layer2_bias)

	pool1 = tf.nn.max_pool(layer2_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID')

	layer3_weights = tf.get_variable("layer3_weights", [3, 3, 64, 128], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer3_bias = tf.Variable(tf.zeros([128]))
	layer3_conv = tf.nn.conv2d(pool1, filter=layer3_weights, strides=[1,1,1,1], padding='SAME')
	layer3_out = tf.nn.relu(layer3_conv + layer3_bias)

	layer4_weights = tf.get_variable("layer4_weights", [3, 3, 128, 128], initializer=tf.contrib.layers.variance_scaling_initializer())
	layer4_bias = tf.Variable(tf.zeros([128]))
	layer4_conv = tf.nn.conv2d(layer3_out, filter=layer4_weights, strides=[1,1,1,1], padding='SAME')
	layer4_out = tf.nn.relu(layer4_conv + layer4_bias)

	pool2 = tf.nn.max_pool(layer4_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID')

	shape = pool2.shape.as_list()
	fc = shape[1] * shape[2] * shape[3]
	reshape = tf.reshape(pool2, [-1, fc])
	fully_connected_weights = tf.get_variable("fully_connected_weights", [fc, 10], initializer=tf.contrib.layers.xavier_initializer())
	fully_connected_bias = tf.Variable(tf.zeros([10]))
	logits = tf.matmul(reshape, fully_connected_weights) + fully_connected_bias

	cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))

	learning_rate = 0.001
	optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

	#Add a few nodes to calculate accuracy and optionally retrieve predictions
	predictions = tf.nn.softmax(logits)
	correct_prediction = tf.equal(tf.argmax(labels, 1), tf.argmax(predictions, 1))
	accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

	with tf.Session(graph=graph) as session:
	tf.global_variables_initializer().run()

	num_steps = 5000
	batch_size = 100
	for step in range(num_steps):
	offset = (step * batch_size) % (train_labels.shape[0] – batch_size)
	batch_images = train_images[offset:(offset + batch_size), :]
	batch_labels = train_labels[offset:(offset + batch_size), :]
	feed_dict = {input: batch_images, labels: batch_labels}

	_, c, acc = session.run([optimizer, cost, accuracy], feed_dict=feed_dict)

	if step % 100 == 0:
	print("Cost: ", c)
	print("Accuracy: ", acc * 100.0, "%")

	#Test
	num_test_batches = int(len(test_images) / 100)
	total_accuracy = 0
	total_cost = 0
	for step in range(num_test_batches):
	offset = (step * batch_size) % (train_labels.shape[0] – batch_size)
	batch_images = test_images[offset:(offset + batch_size)]
	batch_labels = test_labels[offset:(offset + batch_size)]
	feed_dict = {input: batch_images, labels: batch_labels}

	c, acc = session.run([cost, accuracy], feed_dict=feed_dict)
	total_cost = total_cost + c
	total_accuracy = total_accuracy + acc

	print("Test Cost: ", total_cost / num_test_batches)
	print("Test accuracy: ", total_accuracy * 100.0 / num_test_batches, "%")

view raw

ltfn_6_fully.py

hosted with ❤ by GitHub

LTFN 5: Building a ConvNet

Part of the series Learn TensorFlow Now

In the last post we looked at the building blocks of a convolutional neural net. The convolution operation works by sliding a filter along the input and taking the dot product at each location to generate an output volume.

The parameters we need to consider when building a convolutional layer are:

1. Padding – Should we pad the input with zeroes?
2. Stride – Should we move the filter more than one pixel at a time?
3. Input depth – Each convolutional filter must have a depth that matches the input depth.
4. Number of filters – We can stack multiple filters to increase the depth of the output.

With this knowledge we can construct our first convolutional neural network. We’ll start by creating a single convolutional layer that operates on a batch of input images of size 28x28x1.

	layer1_weights = tf.Variable(tf.random_normal([3, 3, 1, 64])) #3x3x1x64
	layer1_bias = tf.Variable(tf.zeros([64])) #64
	layer1_conv = tf.nn.conv2d(input, filter=layer1_weights, strides=[1,1,1,1], padding='SAME') #28x28x64
	layer1_out = tf.nn.relu(layer1_conv + layer1_bias) #28x28x64

view raw

ltfn_5_1.py

hosted with ❤ by GitHub

Visualization of `layer1` with the corresponding dimensions marked.

We start by creating a 4-D Tensor for layer1_weights. This Tensor represents the weights of the various filters that will be used in our convolution and then trained via gradient descent. By default, TensorFlow uses the format [filter_height, filter_width, in_depth, out_depth] for convolutional filters. In this example, we’re defining 64 filters each of which has a height of 3, width of 3, and an input depth of 1.

Depth

It’s important to remember that in_depth must always match the depth of the input we’re convolving. If our images were RGB, we would have had to create filters with a depth of 3.

On the other hand, we can increase or decrease output depth simply by changing the value we specify for out_depth. This represents how many independent filters we’ll create and therefore the depth of the output. In our example, we’ve specified 64 filters and we can see layer1_conv has a corresponding depth of 64.

Stride

Stride represents how fast we move the filter along each dimension. By default, TensorFlow expects stride to be defined in terms of [batch_stride, height_stride, width_stride, depth_stride]. Typically, batch_stride and depth_stride are always 1 as we don’t want to skip over examples in a batch or entire slices of volume. In the above example, we’re using strides=[1,1,1,1] to specify that we’ll be moving the filters across the image one pixel at a time.

Padding

TensorFlow allows us to specify either SAME or VALID padding. VALID padding does not pad the image with zeroes. Specifying SAME pads the image with enough zeroes such that the output will have the same height and with dimensions as the input assuming we’re using a stride of 1. Most of the time we use SAME padding so as not to have the output shrink at each layer of our network. To dig into the specifics of how padding is calculated, see TensorFlow’s documentation on convolutions.

Bias

Finally, we have to remember to include a bias term for each filter. Since we’ve created 64 filters, we’ll have to create a bias term of size 64. We apply bias after performing the convolution operation, but before passing the result to our ReLU non-linearity.

Max Pooling

As the above shows, as the input flows through our network, intermediate representations (eg. layer1_out) keep the same width and height while increasing in depth. However, if we continue making deeper and deeper representations we’ll find that the number of operations we need to perform will explode. Each of the filters has to be dragged across as 28x28 input and take the dot-product. As our filters get deeper this results in larger and larger groups of multiplications and additions.

Periodically we would like to downsample and compress our intermediate representations to have smaller height and width dimensions. The most common way to do this is by using a max pooling operation.

Max pooling is relatively simple. We slide a window (also called a kernel) along the input and simply take the max value at each point. As with convolutions, we can control the size of the sliding window, the stride of the window and choose whether or not to pad the input with zeroes.

Below is a simple example demonstrating max pooling on an unpadded input of 4x4 with a kernel size of 2x2 and a stride of 2:

Max pooling is the most popular way to downsample, but it’s certainly not the only way. Alternatives include average-pooling, which takes the average value at each point or vanilla convolutions with stride of 2. For more on this approach see: The All Convolutional Net.

The most common form of max pooling uses a 2x2 kernel (ksize=[1,2,2,1]) and a stride of 2 in the width and height dimensions (stride=[1,2,2,1]).

Putting it all together

Finally we have all the pieces to build our first convolutional neural network. Below is a network with four convolutional layers and two max pooling layers (You can find the complete code at the end of this post).

	layer1_weights = tf.Variable(tf.random_normal([3, 3, 1, 64])) #3x3x1x64
	layer1_bias = tf.Variable(tf.zeros([64])) #64
	layer1_conv = tf.nn.conv2d(input, filter=layer1_weights, strides=[1,1,1,1], padding='SAME') #28x28x64
	layer1_out = tf.nn.relu(layer1_conv + layer1_bias) #28x28x64

	layer2_weights = tf.Variable(tf.random_normal([3, 3, 64, 64])) #3x3x64x64
	layer2_bias = tf.Variable(tf.zeros([64])) #64
	layer2_conv = tf.nn.conv2d(layer1_out, filter=layer2_weights, strides=[1,1,1,1], padding='SAME')#28x28x64
	layer2_out = tf.nn.relu(layer2_conv + layer2_bias) #28x28x64

	pool1 = tf.nn.max_pool(layer2_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID') #14x14x64

	layer3_weights = tf.Variable(tf.random_normal([3, 3, 64, 128])) #3x3x64x128
	layer3_bias = tf.Variable(tf.zeros([128])) #128
	layer3_conv = tf.nn.conv2d(pool1, filter=layer3_weights, strides=[1,1,1,1], padding='SAME') #14x14x128
	layer3_out = tf.nn.relu(layer3_conv + layer3_bias) #14x14x128

	layer4_weights = tf.Variable(tf.random_normal([3, 3, 128, 128])) #3x3x128x128
	layer4_bias = tf.Variable(tf.zeros([128])) #128
	layer4_conv = tf.nn.conv2d(layer3_out, filter=layer4_weights, strides=[1,1,1,1], padding='SAME')#14x14x128
	layer4_out = tf.nn.relu(layer4_conv + layer4_bias) #14x14x128

	pool2 = tf.nn.max_pool(layer4_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID') #7x7x128

	shape = pool2.shape.as_list()
	fc = shape[1] * shape[2] * shape[3] #7x7x256 = 6,272
	reshape = tf.reshape(pool2, [-1, fc])
	fully_connected_weights = tf.Variable(tf.random_normal([fc, 10])) #6,272×10
	fully_connected_bias = tf.Variable(tf.zeros([10])) #10
	logits = tf.matmul(reshape, fully_connected_weights) + fully_connected_bias #10

view raw

ltfn_5_2.py

hosted with ❤ by GitHub

Before diving into the code, let’s take a look at a visualization of our network from input through pool2 to get a sense of what’s going on:

There are a few things worth noticing here. First, notice that in_depth of each set of convolutional filters matches the depth of the previous layers. Also note that the depth of each intermediate layer is determined by the number of filters (out_depth) at each layer.

We should also notice that every pooling layer we’ve used is a 2x2 max pooling operation using a stride=[1,2,2,1]. Recall the default format for stride is [batch_stride, height_stride, width_stride, depth_stride]. This means that we slide through the height and width dimensions twice as fast as depth. This results in a shrinkage of height and width by a factor of 2. As data moves through our network, the representations become deeper with smaller width and height dimensions.

Finally, the last six lines are a little bit tricky. At the conclusion of our network we need to make predictions about which number we’re seeing. The way we do that is by adding a fully connected layer at the very end of our network. We reshape pool2 from a 7x7x128 3-D volume to a single vector with 6,272 values. Finally, we connect this vector to 10 output logits from which we can extract our predictions.

With everything in place, we can run our network and take a look at how well it performs:

Cost: 979579.0
Accuracy: 7.0000000298 %
Cost: 174063.0
Accuracy: 23.9999994636 %
Cost: 95255.1
Accuracy: 47.9999989271 %

...

Cost: 10001.9
Accuracy: 87.9999995232 %
Cost: 16117.2
Accuracy: 77.999997139 %
Test Cost: 15083.0833307
Test accuracy: 81.8799999356 %

Yikes. There are two things that jump out at me when I look at these numbers:

The cost seems very high despite achieving a reasonable result.
The test accuracy has decreased when compared to our fully-connected network which achieved an accuracy of ~89%

So are convolutional nets broken? Was all this effort for nothing? Not quite. Next time we’ll look at an underlying problem with how we’re choosing our initial random weight values and an improved strategy that should improve our results beyond that of our fully-connected network.

Complete Code

	import tensorflow as tf
	import numpy as np
	from tensorflow.examples.tutorials.mnist import input_data
	mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

	train_images = np.reshape(mnist.train.images, (-1, 28, 28, 1))
	train_labels = mnist.train.labels
	test_images = np.reshape(mnist.test.images, (-1, 28, 28, 1))
	test_labels = mnist.test.labels


	graph = tf.Graph()
	with graph.as_default():
	input = tf.placeholder(tf.float32, shape=(None, 28, 28, 1))
	labels = tf.placeholder(tf.float32, shape=(None, 10))

	layer1_weights = tf.Variable(tf.random_normal([3, 3, 1, 64]))
	layer1_bias = tf.Variable(tf.zeros([64]))
	layer1_conv = tf.nn.conv2d(input, filter=layer1_weights, strides=[1,1,1,1], padding='SAME')
	layer1_out = tf.nn.relu(layer1_conv + layer1_bias)

	layer2_weights = tf.Variable(tf.random_normal([3, 3, 64, 64]))
	layer2_bias = tf.Variable(tf.zeros([64]))
	layer2_conv = tf.nn.conv2d(layer1_out, filter=layer2_weights, strides=[1,1,1,1], padding='SAME')
	layer2_out = tf.nn.relu(layer2_conv + layer2_bias)

	pool1 = tf.nn.max_pool(layer2_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID')

	layer3_weights = tf.Variable(tf.random_normal([3, 3, 64, 128]))
	layer3_bias = tf.Variable(tf.zeros([128]))
	layer3_conv = tf.nn.conv2d(pool1, filter=layer3_weights, strides=[1,1,1,1], padding='SAME')
	layer3_out = tf.nn.relu(layer3_conv + layer3_bias)

	layer4_weights = tf.Variable(tf.random_normal([3, 3, 128, 128]))
	layer4_bias = tf.Variable(tf.zeros([128]))
	layer4_conv = tf.nn.conv2d(layer3_out, filter=layer4_weights, strides=[1,1,1,1], padding='SAME')
	layer4_out = tf.nn.relu(layer4_conv + layer4_bias)

	pool2 = tf.nn.max_pool(layer4_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID')

	shape = pool2.shape.as_list()
	fc = shape[1] * shape[2] * shape[3]
	reshape = tf.reshape(pool2, [-1, fc])
	fc_weights = tf.Variable(tf.random_normal([fc, 10]))
	fc_bias = tf.Variable(tf.zeros([10]))
	logits = tf.matmul(reshape, fc_weights) + fc_bias

	cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))

	learning_rate = 0.0000001
	optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

	#Add a few nodes to calculate accuracy and optionally retrieve predictions
	predictions = tf.nn.softmax(logits)
	correct_prediction = tf.equal(tf.argmax(labels, 1), tf.argmax(predictions, 1))
	accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

	with tf.Session(graph=graph) as session:
	tf.global_variables_initializer().run()

	num_steps = 5000
	batch_size = 100
	for step in range(num_steps):
	offset = (step * batch_size) % (train_labels.shape[0] – batch_size)
	batch_images = train_images[offset:(offset + batch_size), :]
	batch_labels = train_labels[offset:(offset + batch_size), :]
	feed_dict = {input: batch_images, labels: batch_labels}

	_, c, acc = session.run([optimizer, cost, accuracy], feed_dict=feed_dict)

	if step % 100 == 0:
	print("Cost: ", c)
	print("Accuracy: ", acc * 100.0, "%")


	#Test
	num_test_batches = int(len(test_images) / 100)
	total_accuracy = 0
	total_cost = 0
	for step in range(num_test_batches):
	offset = (step * batch_size) % (train_labels.shape[0] – batch_size)
	batch_images = test_images[offset:(offset + batch_size)]
	batch_labels = test_labels[offset:(offset + batch_size)]
	feed_dict = {input: batch_images, labels: batch_labels}

	c, acc = session.run([cost, accuracy], feed_dict=feed_dict)
	total_cost = total_cost + c
	total_accuracy = total_accuracy + acc

	print("Test Cost: ", total_cost / num_test_batches)
	print("Test accuracy: ", total_accuracy * 100.0 / num_test_batches, "%")

view raw

ltfn_5_full.py

hosted with ❤ by GitHub

LTFN 4: Intro to Convolutional Neural Networks

Part of the series Learn TensorFlow Now

The neural networks we’ve built so far have had a relatively simple structure. The input to each layer is fully connected to the output of the previous layer. For this reason, these layers are commonly called fully connected layers.

Two **fully connected** layers in a neural network.

This has been mathematically convenient because we’ve been able to represent each layer’s output as a matrix multiplication of the previous layer’s output (a vector) with the current layer’s weights.

However, as we build more complex networks for image recognition, there are certain properties we want that are difficult to get from fully connected layers. Some of these properties include:

Translational Invariance – A fancy phrase for “A network trained to recognize cats should recognize cats equally well if they’re in the top left of the picture or the bottom right of the picture”. If we move the cat around the image, we should still expect to recognize it.
Translational invariance suggests we should recognize objects regardless of where they’re located in the image.
Local Connectivity – This means that we should take advantage of features within a certain area of the image. Remember that in previous posts we treated the input as a single row of pixels. This meant that local features (e.g. edges, curves, loops) are very hard for our networks to identify and pick out. Ideally our network should try to identify patterns than occur within local regions of the image and use these patterns to influence its predictions.

Today we’re going look at one of the most successful classes of neural networks: Convolutional Neural Networks. Convolutional Neural Networks have been shown to give us both translational invariance and local connectivity.

The building block of a convolutional neural network is a convolutional filter. It is a square (typically 3x3) set of weights. The convolutional filter looks at pieces of the input of the same shape. As it does, it takes the dot product of the weights with the input and saves the result in the output. The convolutional filter is dragged along the entire input until the entire input has been covered. Below is a simple example with a (random) 5x5 input and a (random) 3x3 filter.

So why is this useful? Consider the following examples with a vertical line in the input and a 3×3 filter with weights chosen specifically to detect vertical edges.

Vertical edge detection from light-to-dark.

Vertical edge detection from dark-to-light.

We can see that with hand-picked weights, we’re able to generate patterns in the output. In this example, light-to-dark transitions produce large positive values while dark-to-light transitions produce large negative values. Where there is no change at all, the filter will simply produce zeroes.

While we’ve chosen the above filter’s weights manually, it turns out that training our network via gradient descent ends up selecting very good weights for these filters. As we add more convolutional layers to our network they begin to be able to recognize more abstract concepts such as faces, whiskers, wheels etc.

Padding

You may have noticed that the output above has a smaller width and height than the original input. If we pass this output to another convolutional layer it will continue to shrink. Without dealing with this shrinkage, we’ll find that this puts an upper bound on how many convolutional layers we can have in our network.

SAME Padding

The most common way to deal with this shrinkage is to pad the entire image with enough zeroes such that the output shape will have the same width and height as the input. This is called SAME padding and allows us to continue passing the output to more and more convolutional layers without worrying about shrinking width and height dimensions. Below we take our first example (5×5 input) and pad it with zeroes to make sure the output is still 5×5.

A `5x5` input padded with zeroes to generate a `5x5` output.

VALID Padding

VALID padding does not pad the input with anything. It probably would have made more sense to call it NO padding or NONE padding.

Stride

So far we’ve been moving the convolutional filter across the input one pixel at a time. In other words, we’ve been using a stride=1. Stride refers to the number of pixels we move the filter in the width and height dimension every time we compute a dot-product. The most common stride value is stride=1, but certain algorithms require larger stride values. Below is an example using stride=2.

Notice that larger stride values result in larger decreases in output height and width. Occasionally this is desirable near the start of a network when working with larger images. Smaller input width and height can make the calculations more manageable in deeper layers of the network.

Input Depth

In our previous examples we’ve been working with inputs that have variable height and width dimensions, but no depth dimension. However, some images (e.g. RGB) have depth, and we need some way to account for it. The key is to extend our filter’s depth dimension to match the depth dimension of the input.

Unfortunately, I lack the animation skills to properly show an animated example of this, but the following image may help:

Convolution over an input with a depth of 2 using a single filter with a depth of 2.

Above we have an input of size 5x5x2 and a single filter of size 3x3x2. The filter is dragged across the input and once again the dot product is taken at each point. The difference here is that there are 18 values being added up at each point (9 from each depth of the input image). The result is an output with a single depth dimension.

Output Depth

We can also control the output depth by stacking up multiple convolutional filters. Each filter acts independently of one another while computing its results and then all of the results are stacked together to create the ouptut. This means we can control output depth simply by adding or removing convolutional filters.

Two convolutional filters result in a output depth of two.

It’s very important to note that there are two distinct convolutional filters above. The weights of each convolutional filter are distinct from the weights of the other convolutional filter. Each of these two filters has a shape of 3x3x2. If we wanted to get a deeper output, we could continue stacking more of these 3x3x2 filters on top of one another.

Imagine for a moment that we stacked four convolutional filters on top of one another, each with a set of weights trained to recognize different patterns. One might recognize horizontal edges, one might recognize vertical edges, one might recognize diagonal edges from top-left to bottom-right and one might recognize diagonal edges from bottom-left to top-right. Each of these filters would produce one depth layer of the output with values where their respective edges were detected. Later layers of our network would be able to act on this information and build up even more complex representations of the input.

Next up

There is a lot to process in this post. We’ve seen a brand new building block for our neural networks called the convolutional filter and a myriad of ways to customize it. In the next post we’ll implement our first convolutional neural network in TensorFlow and try to better understand practical ways to use this building block to build a better digit recognizer.

LTFN 3: Deeper Networks

Part of the series Learn TensorFlow Now

In the last post, we saw our network achieve about 60% accuracy. One common way to improve a neural network’s performance is to make it deeper. Before we start adding layers to our network, it’s worth taking a moment to explore one of the key advantages of deep neural networks.

Historically, a lot of effort was invested in crafting hand-engineered features that could be fed to shallow networks (or other learning algorithms). In image detection we might modify the input to highlight horizontal or vertical edges. In voice recognition we might filter out noise or various frequencies not typically found in human speech. Unfortunately, hand-engineering features often required years of expertise and lots of time.

Below is a network created with TensorFlow Playground that demonstrates this point. By feeding modified versions of the input to a shallow network, we are able to train it to recognize a non-linear spiral pattern.

A shallow network requires various modifications to the input features to classify the “Swiss Roll” problem.

A shallow network is capable of learning complex patterns only when fed modified versions of the input. A key idea behind deep learning is to do away with hand-engineered features whenever possible. Instead, by making the network deeper, we can convince the network to learn the features it really needs to solve the problem. In image recognition, the first few layers of the network learn to recognize simple features (eg. edge detection), while deeper layers respond to more complex features (eg. human faces). Below, we’ve made the network deeper and removed all dependencies on additional features.

A deep network is capable of classifying the points in a “Swiss Roll” using only the original input.

Making our network deeper

Let’s try making our network deeper by adding two more layers. We’ll replace layer1_weights and layer1_bias with the following:

	layer1_weights = tf.Variable(tf.random_normal([784, 500]))
	layer1_bias = tf.Variable(tf.zeros([500]))
	layer1_output = tf.nn.relu(tf.matmul(input, layer1_weights) + layer1_bias)

	layer2_weights = tf.Variable(tf.random_normal([500, 500]))
	layer2_bias = tf.Variable(tf.zeros([500]))
	layer2_output = tf.nn.relu(tf.matmul(layer1_output, layer2_weights) + layer2_bias)

	layer3_weights = tf.Variable(tf.random_normal([500, 10]))
	layer3_bias = tf.Variable(tf.zeros([10]))
	logits = tf.matmul(layer2_output, layer3_weights) + layer3_bias

view raw

ltfn_3_1.py

hosted with ❤ by GitHub

Note: When discussing the network’s shapes, I ignore the batch dimension. For example, where a shape is [None, 784] I will refer to it as a vector with 784 elements. I find it helps to imagine a batch size of 1 to avoid having to think about more complex shapes.

The first thing to notice is the change in shape. layer1 now accepts an input of 784 values and produces an intermediate vector layer1_output with 500 elements. We then take these 500 values through layer2 which also produces an intermediate vector layer2_output with 500 elements. Finally, we take these 500 values through layer3 and produce our logit vector with 10 elements.

Why did I choose 500 elements? No reason, it was just an arbitrary value that seemed to work. If you’re following along at home, you could try adding more layers or making them wider (ie. use a size larger than 500).

ReLU

Another important change is the addition of tf.nn.relu() in layer1 and layer2. Note that it is applied to the result of the matrix multiplication of the previous layer’s output with the current layer’s weights.

So what is a ReLU? ReLU stands for “Rectified Linear Unit” and is an activation function. An activation function is applied to the output of each layer of a neural network. It turns out that if we don’t include activation functions, it can be mathematically shown (by people much smarter than me) that our three layer network is equivalent to a single layer network. This is obviously a BadThing™ as it means we lose all the advantages of building a deep neural network.

I’m (very obviously) glossing over the details here, so if you’re new to neural networks and want to learn more see: Why do you need non-linear activation functions?

Other historical activation functions include sigmoid and tanh. These days, ReLU is almost always the right choice of activation function and we’ll be using it exclusively for our networks.

Graphs for ReLU, sigmoid and tanh functions

Learning Rate

Finally, one other small change needs to be made: The learning rate needs to be changed from 0.01 to 0.0001. Learning rate is one of the most important, but most finicky hyperparameters to choose when training your network. Too small and the network takes a very long time to train, too large and your network doesn’t converge. In later posts we’ll look at methods that can help with this, but for now I’ve just used the ol’ fashioned “Guess and Check” method until I found a learning rate that worked well.

Alchemy of Hyperparameters

We’ve started to see a few hyperparameters that we must choose when building a neural network:

Number of layers
Width of layers
Learning rate

It’s an uncomfortable reality that we have no good way to choose values for these hyperparameters. What’s worse is that we typically can’t explain why a certain hyperparameter value works well and others do not. The only reassurance I can offer is:

Other people think this is a problem
As you build more networks, you’ll develop a rough intuition for choosing hyperparameter values

Putting it all together

Now that we’ve chosen a learning rate and created more intermediate layers, let’s put it all together and see how our network performs.

	import tensorflow as tf
	from tensorflow.examples.tutorials.mnist import input_data
	mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

	train_images = mnist.train.images;
	train_labels = mnist.train.labels
	test_images = mnist.test.images;
	test_labels = mnist.test.labels

	graph = tf.Graph()
	with graph.as_default():
	input = tf.placeholder(tf.float32, shape=(None, 784))
	labels = tf.placeholder(tf.float32, shape=(None, 10))

	#Add our three layers
	layer1_weights = tf.Variable(tf.random_normal([784, 500]))
	layer1_bias = tf.Variable(tf.zeros([500]))
	layer1_output = tf.nn.relu(tf.matmul(input, layer1_weights) + layer1_bias)

	layer2_weights = tf.Variable(tf.random_normal([500, 500]))
	layer2_bias = tf.Variable(tf.zeros([500]))
	layer2_output = tf.nn.relu(tf.matmul(layer1_output, layer2_weights) + layer2_bias)

	layer3_weights = tf.Variable(tf.random_normal([500, 10]))
	layer3_bias = tf.Variable(tf.zeros([10]))
	logits = tf.matmul(layer2_output, layer3_weights) + layer3_bias

	cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))

	#Use a smaller learning rate
	learning_rate = 0.0001
	optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

	predictions = tf.nn.softmax(logits)
	correct_prediction = tf.equal(tf.argmax(labels, 1), tf.argmax(predictions, 1))
	accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

	with tf.Session(graph=graph) as session:
	tf.global_variables_initializer().run()

	num_steps = 5000
	batch_size = 100
	for step in range(num_steps):
	offset = (step * batch_size) % (train_labels.shape[0] – batch_size)
	batch_images = train_images[offset:(offset + batch_size), :]
	batch_labels = train_labels[offset:(offset + batch_size), :]
	feed_dict = {input: batch_images, labels: batch_labels}

	_, c, acc = session.run([optimizer, cost, accuracy], feed_dict=feed_dict)

	if step % 100 == 0:
	print("Cost: ", c)
	print("Accuracy: ", acc * 100.0, "%")

	#Test
	num_test_batches = int(len(test_images) / 100)
	total_accuracy = 0
	total_cost = 0
	for step in range(num_test_batches):
	offset = (step * batch_size) % (train_labels.shape[0] – batch_size)
	batch_images = test_images[offset:(offset + batch_size), :]
	batch_labels = test_labels[offset:(offset + batch_size), :]
	feed_dict = {input: batch_images, labels: batch_labels}

	_, c, acc = session.run([optimizer, cost, accuracy], feed_dict=feed_dict)
	total_cost = total_cost + c
	total_accuracy = total_accuracy + acc

	print("Test Cost: ", total_cost / num_test_batches)
	print("Test accuracy: ", total_accuracy * 100.0 / num_test_batches, "%")

view raw

ltfn_3_full.py

hosted with ❤ by GitHub

After running this code you should see output similar to:

Cost:  4596.864
Accuracy:  7.999999821186066 %
Cost:  882.4881
Accuracy:  30.000001192092896 %
Cost:  609.4177
Accuracy:  51.99999809265137 %
Cost:  494.5303
Accuracy:  56.00000023841858 %

...

Cost:  57.793114
Accuracy:  89.99999761581421 %
Cost:  148.92995
Accuracy:  81.00000023841858 %
Cost:  67.42319
Accuracy:  89.99999761581421 %
Test Cost:  107.98408660641905
Test accuracy:  85.74999994039536 %

Our network has improved from 60% accuracy to 85% accuracy. This is great progress, clearly things are moving in the right direction! Next week we’ll look at a more complicated neural network structure called a “Convolutional Neural Network” which is one of the basic building blocks of today’s top image classifiers.

For the sake of completeness, I’ve included a TensorBoard visualization of the network we’ve created below:

Visualization of our three-layer network with `layer1` expanded. Notice the addition of `layer1_output` following the addition with `layer1_bias`. This represents the ReLU activation function.

LTFN 2: Graphs and Shapes

Part of the series Learn TensorFlow Now

TensorFlow Graphs

Before we improve our network, we have to take a moment to chat about TensorFlow graphs. As we saw in the previous post, we follow two steps when using TensorFlow:

Create a computational graph
Run data through the graph using tf.Session.run()

Let’s take a look at what’s actually happening when we call tf.Session.run(). Consider our graph and session code from last time:

o, c, = session.run([optimizer, cost], feed_dict=feed_dict)

view raw

ltfn_2_1.py

hosted with ❤ by GitHub

When we pass optimizer and cost to session.run(), TensorFlow looks at the dependencies for these two nodes. For example, we can see above that optimizer depends on:

cost
layer1_weights
layer1_bias
input

We can also see that cost depends on:

logits
labels

When we wish to evaluate optimizer and cost, TensorFlow first runs all the operations defined by the previous nodes, then calculates the required results and returns them. Since every node ends up being a dependency of optimizer and cost, this means that every operation in our TensorFlow graph is executed with every call to session.run().

But what if we don’t want to run every operation? If we want to pass test data to our network, we don’t want to run the operations defined by optimizer. (After all, we don’t want to train our network on our test set!) Instead, we’d just want to extract predictions from logits. In that case, we could instead run our network as follows:

	batch_images = test_images[offset:(offset + batch_size), :] # Note: test images
	feed_dict = {input: batch_images} # Note: No labels
	l = session.run([logits], feed_dict=feed_dict) # Only asking for logits

view raw

ltfn_2_2.py

hosted with ❤ by GitHub

This would execute only the subset of nodes required to compute the values of logits, highlighted below:

Our computational graph with only dependencies of `logits` highlighted in orange.

Note: As labels is not one of the dependencies of logits we don’t need to provide it.

Understanding the dependencies of the computational graphs we create is important. We should always try to be aware of exactly what operations will be running when we call session.run() to avoid accidentally running the wrong operations.

Shapes

Another important topic to understand is how TensorFlow shapes work. In our previous post all our shapes were completely defined. Consider the following tf.Placeholders for input and labels:

	input = tf.placeholder(tf.float32, shape=(100, 784))
	labels = tf.placeholder(tf.float32, shape=(100, 10))

view raw

ltfn_2_3.py

hosted with ❤ by GitHub

We have defined these tensors to have a 2-D shape of precisely (100, 784) and (100, 10). This restricts us to a computational graph that always expects 100 images at a time. What if we have a training set that isn’t divisible by 100? What if we want to test on single images?

The answer is to use dynamic shapes. In places where we’re not sure what shape we would like to support, we just substitute in None. For example, if we want to allow variable batch sizes, we simply write:

	input = tf.placeholder(tf.float32, shape=(None, 784))
	labels = tf.placeholder(tf.float32, shape=(None, 10))

view raw

ltfn_2_4.py

hosted with ❤ by GitHub

Now we can pass in batch sizes of 1, 10, 283 or any other size we’d like. From this point on, we’ll be defining all of our tf.Placeholders in this fashion.

Accuracy

One important question remains: “How well is our network doing?“. In the previous post, we saw cost decreasing, but we had no concrete metric against which we could compare our network. We’ll keep things simple and use accuracy as our metric. We just want to measure the average number of correction predictions:

	predictions = tf.nn.softmax(logits)
	correct_prediction = tf.equal(tf.argmax(labels, 1), tf.argmax(predictions, 1))
	accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

view raw

ltfn_2_5.py

hosted with ❤ by GitHub

In the first line, we convert logits to a set of predictions using tf.nn.softmax. Remember that our labels are 1-hot encoded, meaning each one contains 10 numbers, one of which is 1. logits is the same shape, but the values in logits can be almost anything. (eg. values in logits could be -4, 234, 0.5 and so on). We want our predictions to have a few qualities that logits does not possess:

The sum of the values in predictions for a given image should be 1
No values in predictions should be greater than 1
No values in predictions should be negative
The highest value in predictions will be our prediction for a given image. (We can use argmax to find this)

Applying tf.nn.softmax() to logits gives us these desired properties. For more details on softmax, watch this video by Andrew Ng.

The second line takes the argmax of our predictions and of our labels. Then tf.equal creates a vector that contains either True (when the values match) and False when the values don’t match.

Finally, we use tf.reduce_mean to calculate the average number of times we get the prediction correct for this batch. We store this result in accuracy.

Putting it all together

Now that we better understand TensorFlow graphs, shape and have a metric with which to judge our algorithm, let’s put it all together to evaluate our performance on the test set, after training has finished.

Note that almost all of the new code relates to running the test set.

	import tensorflow as tf
	from tensorflow.examples.tutorials.mnist import input_data
	mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

	train_images = mnist.train.images;
	train_labels = mnist.train.labels
	test_images = mnist.test.images;
	test_labels = mnist.test.labels

	graph = tf.Graph()
	with graph.as_default():
	input = tf.placeholder(tf.float32, shape=(None, 784))
	labels = tf.placeholder(tf.float32, shape=(None, 10))

	layer1_weights = tf.Variable(tf.random_normal([784, 10]))
	layer1_bias = tf.Variable(tf.zeros([10]))

	logits = tf.matmul(input, layer1_weights) + layer1_bias
	cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))

	learning_rate = 0.01
	optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

	#Add a few nodes to calculate accuracy and optionally retrieve predictions
	predictions = tf.nn.softmax(logits)
	correct_prediction = tf.equal(tf.argmax(labels, 1), tf.argmax(predictions, 1))
	accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

	with tf.Session(graph=graph) as session:
	tf.global_variables_initializer().run()

	num_steps = 2000
	batch_size = 100
	for step in range(num_steps):
	offset = (step * batch_size) % (train_labels.shape[0] – batch_size)
	batch_images = train_images[offset:(offset + batch_size), :]
	batch_labels = train_labels[offset:(offset + batch_size), :]
	feed_dict = {input: batch_images, labels: batch_labels}

	_, c, acc = session.run([optimizer, cost, accuracy], feed_dict=feed_dict)

	if step % 100 == 0:
	print("Cost: ", c)
	print("Accuracy: ", acc * 100.0, "%")

	#Test
	num_test_batches = int(len(test_images) / 100)
	total_accuracy = 0
	total_cost = 0
	for step in range(num_test_batches):
	offset = (step * batch_size) % (train_labels.shape[0] – batch_size)
	batch_images = test_images[offset:(offset + batch_size), :]
	batch_labels = test_labels[offset:(offset + batch_size), :]
	feed_dict = {input: batch_images, labels: batch_labels}

	#Note that we do not pass in optimizer here.
	c, acc = session.run([cost, accuracy], feed_dict=feed_dict)
	total_cost = total_cost + c
	total_accuracy = total_accuracy + acc

	print("Test Cost: ", total_cost / num_test_batches)
	print("Test accuracy: ", total_accuracy * 100.0 / num_test_batches, "%")

view raw

ltfn_2_full.py

hosted with ❤ by GitHub

One question you might ask is: Why not just predict all the test images at once, in one big batch of 10,000? The problem is that when we train larger networks on our GPU, we won’t be able to fit all 10,000 images and the required operations in our GPU’s memory. Instead we have to process the test set in batches similar to how we train the network.

Finally, let’s run it and look at the output. When I run it on my local machine I receive the following:

Cost:  20.207457
Accuracy:  7.999999821186066 %
Cost:  10.040323
Accuracy:  14.000000059604645 %
Cost:  8.528659
Accuracy:  14.000000059604645 %
Cost:  6.8867884
Accuracy:  23.999999463558197 %
Cost:  7.1556334
Accuracy:  21.99999988079071 %
Cost:  6.312024
Accuracy:  28.00000011920929 %
Cost:  4.679361
Accuracy:  34.00000035762787 %
Cost:  5.220028
Accuracy:  34.00000035762787 %
Cost:  5.167577
Accuracy:  23.999999463558197 %
Cost:  3.5488296
Accuracy:  40.99999964237213 %
Cost:  3.2974648
Accuracy:  43.00000071525574 %
Cost:  3.532155
Accuracy:  46.99999988079071 %
Cost:  2.9645846
Accuracy:  56.00000023841858 %
Cost:  3.0816755
Accuracy:  46.99999988079071 %
Cost:  3.0201495
Accuracy:  50.999999046325684 %
Cost:  2.7738256
Accuracy:  60.00000238418579 %
Cost:  2.4169116
Accuracy:  55.000001192092896 %
Cost:  1.944017
Accuracy:  60.00000238418579 %
Cost:  3.5998762
Accuracy:  50.0 %
Cost:  2.8526196
Accuracy:  55.000001192092896 %
Test Cost:  2.392377197146416
Test accuracy:  59.48999986052513 %
Press any key to continue . . .

So we’re getting a test accuracy of ~60%. This is better than chance, but it’s not as good as we’d like it to be. In the next post, we’ll look at different ways of improving the network.