LTFN 3: Deeper Networks

Part of the series Learn TensorFlow Now

In the last post, we saw our network achieve about 60% accuracy. One common way to improve a neural network’s performance is to make it deeper. Before we start adding layers to our network, it’s worth taking a moment to explore one of the key advantages of deep neural networks.

Historically, a lot of effort was invested in crafting hand-engineered features that could be fed to shallow networks (or other learning algorithms). In image detection we might modify the input to highlight horizontal or vertical edges. In voice recognition we might filter out noise or various frequencies not typically found in human speech. Unfortunately, hand-engineering features often required years of expertise and lots of time.

Below is a network created with TensorFlow Playground that demonstrates this point. By feeding modified versions of the input to a shallow network, we are able to train it to recognize a non-linear spiral pattern.

A shallow network requires various modifications to the input features to classify the “Swiss Roll” problem.

A shallow network is capable of learning complex patterns only when fed modified versions of the input. A key idea behind deep learning is to do away with hand-engineered features whenever possible. Instead, by making the network deeper, we can convince the network to learn the features it really needs to solve the problem. In image recognition, the first few layers of the network learn to recognize simple features (eg. edge detection), while deeper layers respond to more complex features (eg. human faces). Below, we’ve made the network deeper and removed all dependencies on additional features.

A deep network is capable of classifying the points in a “Swiss Roll” using only the original input.

Making our network deeper

Let’s try making our network deeper by adding two more layers. We’ll replace layer1_weights and layer1_bias with the following:

	layer1_weights = tf.Variable(tf.random_normal([784, 500]))
	layer1_bias = tf.Variable(tf.zeros([500]))
	layer1_output = tf.nn.relu(tf.matmul(input, layer1_weights) + layer1_bias)

	layer2_weights = tf.Variable(tf.random_normal([500, 500]))
	layer2_bias = tf.Variable(tf.zeros([500]))
	layer2_output = tf.nn.relu(tf.matmul(layer1_output, layer2_weights) + layer2_bias)

	layer3_weights = tf.Variable(tf.random_normal([500, 10]))
	layer3_bias = tf.Variable(tf.zeros([10]))
	logits = tf.matmul(layer2_output, layer3_weights) + layer3_bias

view raw

ltfn_3_1.py

hosted with ❤ by GitHub

Note: When discussing the network’s shapes, I ignore the batch dimension. For example, where a shape is [None, 784] I will refer to it as a vector with 784 elements. I find it helps to imagine a batch size of 1 to avoid having to think about more complex shapes.

The first thing to notice is the change in shape. layer1 now accepts an input of 784 values and produces an intermediate vector layer1_output with 500 elements. We then take these 500 values through layer2 which also produces an intermediate vector layer2_output with 500 elements. Finally, we take these 500 values through layer3 and produce our logit vector with 10 elements.

Why did I choose 500 elements? No reason, it was just an arbitrary value that seemed to work. If you’re following along at home, you could try adding more layers or making them wider (ie. use a size larger than 500).

ReLU

Another important change is the addition of tf.nn.relu() in layer1 and layer2. Note that it is applied to the result of the matrix multiplication of the previous layer’s output with the current layer’s weights.

So what is a ReLU? ReLU stands for “Rectified Linear Unit” and is an activation function. An activation function is applied to the output of each layer of a neural network. It turns out that if we don’t include activation functions, it can be mathematically shown (by people much smarter than me) that our three layer network is equivalent to a single layer network. This is obviously a BadThing™ as it means we lose all the advantages of building a deep neural network.

I’m (very obviously) glossing over the details here, so if you’re new to neural networks and want to learn more see: Why do you need non-linear activation functions?

Other historical activation functions include sigmoid and tanh. These days, ReLU is almost always the right choice of activation function and we’ll be using it exclusively for our networks.

Graphs for ReLU, sigmoid and tanh functions

Learning Rate

Finally, one other small change needs to be made: The learning rate needs to be changed from 0.01 to 0.0001. Learning rate is one of the most important, but most finicky hyperparameters to choose when training your network. Too small and the network takes a very long time to train, too large and your network doesn’t converge. In later posts we’ll look at methods that can help with this, but for now I’ve just used the ol’ fashioned “Guess and Check” method until I found a learning rate that worked well.

Alchemy of Hyperparameters

We’ve started to see a few hyperparameters that we must choose when building a neural network:

Number of layers
Width of layers
Learning rate

It’s an uncomfortable reality that we have no good way to choose values for these hyperparameters. What’s worse is that we typically can’t explain why a certain hyperparameter value works well and others do not. The only reassurance I can offer is:

Other people think this is a problem
As you build more networks, you’ll develop a rough intuition for choosing hyperparameter values

Putting it all together

Now that we’ve chosen a learning rate and created more intermediate layers, let’s put it all together and see how our network performs.

	import tensorflow as tf
	from tensorflow.examples.tutorials.mnist import input_data
	mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

	train_images = mnist.train.images;
	train_labels = mnist.train.labels
	test_images = mnist.test.images;
	test_labels = mnist.test.labels

	graph = tf.Graph()
	with graph.as_default():
	input = tf.placeholder(tf.float32, shape=(None, 784))
	labels = tf.placeholder(tf.float32, shape=(None, 10))

	#Add our three layers
	layer1_weights = tf.Variable(tf.random_normal([784, 500]))
	layer1_bias = tf.Variable(tf.zeros([500]))
	layer1_output = tf.nn.relu(tf.matmul(input, layer1_weights) + layer1_bias)

	layer2_weights = tf.Variable(tf.random_normal([500, 500]))
	layer2_bias = tf.Variable(tf.zeros([500]))
	layer2_output = tf.nn.relu(tf.matmul(layer1_output, layer2_weights) + layer2_bias)

	layer3_weights = tf.Variable(tf.random_normal([500, 10]))
	layer3_bias = tf.Variable(tf.zeros([10]))
	logits = tf.matmul(layer2_output, layer3_weights) + layer3_bias

	cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))

	#Use a smaller learning rate
	learning_rate = 0.0001
	optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

	predictions = tf.nn.softmax(logits)
	correct_prediction = tf.equal(tf.argmax(labels, 1), tf.argmax(predictions, 1))
	accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

	with tf.Session(graph=graph) as session:
	tf.global_variables_initializer().run()

	num_steps = 5000
	batch_size = 100
	for step in range(num_steps):
	offset = (step * batch_size) % (train_labels.shape[0] – batch_size)
	batch_images = train_images[offset:(offset + batch_size), :]
	batch_labels = train_labels[offset:(offset + batch_size), :]
	feed_dict = {input: batch_images, labels: batch_labels}

	_, c, acc = session.run([optimizer, cost, accuracy], feed_dict=feed_dict)

	if step % 100 == 0:
	print("Cost: ", c)
	print("Accuracy: ", acc * 100.0, "%")

	#Test
	num_test_batches = int(len(test_images) / 100)
	total_accuracy = 0
	total_cost = 0
	for step in range(num_test_batches):
	offset = (step * batch_size) % (train_labels.shape[0] – batch_size)
	batch_images = test_images[offset:(offset + batch_size), :]
	batch_labels = test_labels[offset:(offset + batch_size), :]
	feed_dict = {input: batch_images, labels: batch_labels}

	_, c, acc = session.run([optimizer, cost, accuracy], feed_dict=feed_dict)
	total_cost = total_cost + c
	total_accuracy = total_accuracy + acc

	print("Test Cost: ", total_cost / num_test_batches)
	print("Test accuracy: ", total_accuracy * 100.0 / num_test_batches, "%")

view raw

ltfn_3_full.py

hosted with ❤ by GitHub

After running this code you should see output similar to:

Cost:  4596.864
Accuracy:  7.999999821186066 %
Cost:  882.4881
Accuracy:  30.000001192092896 %
Cost:  609.4177
Accuracy:  51.99999809265137 %
Cost:  494.5303
Accuracy:  56.00000023841858 %

...

Cost:  57.793114
Accuracy:  89.99999761581421 %
Cost:  148.92995
Accuracy:  81.00000023841858 %
Cost:  67.42319
Accuracy:  89.99999761581421 %
Test Cost:  107.98408660641905
Test accuracy:  85.74999994039536 %

Our network has improved from 60% accuracy to 85% accuracy. This is great progress, clearly things are moving in the right direction! Next week we’ll look at a more complicated neural network structure called a “Convolutional Neural Network” which is one of the basic building blocks of today’s top image classifiers.

For the sake of completeness, I’ve included a TensorBoard visualization of the network we’ve created below:

LTFN 3: Deeper Networks

Making our network deeper

ReLU

Learning Rate

Alchemy of Hyperparameters

Putting it all together

Published by joshvarty

Leave a comment Cancel reply

Making our network deeper

ReLU

Learning Rate

Alchemy of Hyperparameters

Putting it all together

Share this:

Related

Published by joshvarty

Leave a comment Cancel reply