Part of the series Learn TensorFlow Now
Over the last nine posts, we built a reasonably effective digit classifier. Now we’re ready to enter the big leagues and try out our VGGNet on a more challenging image recognition task. CIFAR-10 (Canadian Institute For Advanced Research) is a collection of 60,000 cropped images of planes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks.
- 50,000 images in the training set
- 10,000 images in the test set
- Size: 32×32 (1024 pixels)
- 3 Channels (RGB)
- 10 output classes

CIFAR-10 is a natural next-step due to its similarities to the MNIST dataset. For starters, we have the same number of training images, testing images and output classes. CIFAR-10’s images are of size 32x32
which is convenient as we were paddding MNIST’s images to achieve the same size. These similarities make it easy to use our previous VGGNet architecture to classify these images.
Despite the similarities, there are some differences that make CIFAR-10 a more challenging image recognition problem. For starters, our images are RGB and therefore have 3 channels. Detecting lines might not be so easy when they can be drawn in any color. Another challenge is that our images are now 2-D depictions of 3-D objects. In the above image, the center two images represent the “truck” class, but are shown at different angles. This means our network has to learn enough about “trucks” to recognize them at angles it has never seen before.
Loading CIFAR-10
The CIFAR-10 dataset is hosted at: https://www.cs.toronto.edu/~kriz/cifar.html
In order to make it easier to work with, I’ve prepared a small script that downloads, shuffles and caches the dataset locally. You can find it on GitHub here.
After saving this file locally, we can use it to prepare our datasets:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import tensorflow as tf | |
import numpy as np | |
import cifar_data_loader | |
(train_images, train_labels, test_images, test_labels, mean_image) = cifar_data_loader.load_data() | |
print(train_images.shape) | |
print(train_labels.shape) | |
print(test_images.shape) | |
print(test_labels.shape) | |
print(mean_image.shape) |
Running this locally produces the following output:
Attempting to download: https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz 0%....5%....10%....15%....20%....25%....30%....35%....40%....45%....50%....55%....60%....65%....70%....75%....80%....85%....90%....95%....100% Download Complete! (50000, 32, 32, 3) (50000,) (10000, 32, 32, 3) (10000,) (32, 32, 3)
The above output shows that we’ve downloaded the dataset and created a training set of size 50,000
and a test set of size 10,000
. Note: Unlike MNIST, these labels are not 1-hot encoded (otherwise they’d be of size 50,000x10
and 10,000x10
respectively). We have to account for this difference in shape when we build VGGNet for this dataset.
Let’s start by adjusting input
and labels
to fit the CIFAR-10 dataset:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
input = tf.placeholder(tf.float32, shape=(None, 32, 32, 3)) #Input is of size 32x32x3 (RGB images) | |
labels = tf.placeholder(tf.int32, shape=(None), name="labels") #Labels are single integers (tf.int32) |
Next we have to adjust the first layer of our network. Recall from the post on convolutions that each convolutional filter must match the depth of the layer against which it is convolved. Previously we had defined our convolutional filter to be of shape [3, 3, 1, 64]
. That is, a 64
3x3
convolutional filters, each with depth of 1
, matching the depth of our grayscale input image. Now that we’re using RGB images, we must define it to be of shape [3, 3, 3, 64]
:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
layer1_weights = tf.get_variable("layer1_weights", [3, 3, 3, 64], initializer=tf.contrib.layers.variance_scaling_initializer()) |
Another change we must make is the calculation of cost
. Previously we were using tf.nn.softmax_cross_entropy_with_logits()
which is suitable only when our labels are 1-hot encoded. When we represent the labels as single integers, we can instead use tf.nn.sparse_softmax_cross_entropy_with_logits()
. It is otherwise identical to our original softmax cross entropy function.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
cost = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels)) |
Finally, we must also modify our calculation of correction_prediction
(used to calculate accuracy
) to account for the change in label shape. We no longer have to take the tf.argmax
of our labels because they’re already represented as a single number:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
correct_prediction = tf.equal(labels, tf.argmax(predictions, 1, output_type=tf.int32)) |
Note: We have to specify output_type=tf.int32
because tf.argmax()
returns tf.int64
by default.
With that, we’ve got everything we need to test our VGGNet on CIFAR-10. The complete code is presented at the end of this post.
After running our network for 10,000
steps, we’re greeted with the following output:
Cost: 470.996 Accuracy: 9.00000035763 % Cost: 2.00049 Accuracy: 25.0 % ... Cost: 0.553867 Accuracy: 82.9999983311 % Cost: 0.393799 Accuracy: 87.0000004768 % Test Cost: 0.895597087741 Test accuracy: 70.9400003552 %
Our final test accuracy appears to be approximately 71%, which isn’t too great. On one hand this is disappointing as it means our VGGNet architecture (or the method in which we’re training it) doesn’t generalize very well. On the other hand, CIFAR-10 presents us with new opportunities to try out new neural network components and architectures. In the next few posts we’ll explore some of these approaches to build a neural network that can handle the more complex CIFAR-10 dataset.
If you look carefully at the previous results you may have noticed something interesting. For the first time, our test accuracy (71%) is much lower than our training accuracy (~82-87%). This is a problem we’ll discuss in future posts on bias and variance in deep learning.
Complete Code
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import tensorflow as tf | |
import numpy as np | |
import cifar_data_loader | |
(train_images, train_labels, test_images, test_labels, mean_image) = cifar_data_loader.load_data() | |
print(train_images.shape) | |
print(train_labels.shape) | |
print(test_images.shape) | |
print(test_labels.shape) | |
print(mean_image.shape) | |
graph = tf.Graph() | |
with graph.as_default(): | |
input = tf.placeholder(tf.float32, shape=(None, 32, 32, 3)) | |
labels = tf.placeholder(tf.int32, shape=(None), name="labels") | |
layer1_weights = tf.get_variable("layer1_weights", [3, 3, 3, 64], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer1_bias = tf.Variable(tf.zeros([64])) | |
layer1_conv = tf.nn.conv2d(input, filter=layer1_weights, strides=[1,1,1,1], padding='SAME') | |
layer1_out = tf.nn.relu(layer1_conv + layer1_bias) | |
layer2_weights = tf.get_variable("layer2_weights", [3, 3, 64, 64], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer2_bias = tf.Variable(tf.zeros([64])) | |
layer2_conv = tf.nn.conv2d(layer1_out, filter=layer2_weights, strides=[1,1,1,1], padding='SAME') | |
layer2_out = tf.nn.relu(layer2_conv + layer2_bias) | |
pool1 = tf.nn.max_pool(layer2_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID') | |
layer3_weights = tf.get_variable("layer3_weights", [3, 3, 64, 128], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer3_bias = tf.Variable(tf.zeros([128])) | |
layer3_conv = tf.nn.conv2d(pool1, filter=layer3_weights, strides=[1,1,1,1], padding='SAME') | |
layer3_out = tf.nn.relu(layer3_conv + layer3_bias) | |
layer4_weights = tf.get_variable("layer4_weights", [3, 3, 128, 128], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer4_bias = tf.Variable(tf.zeros([128])) | |
layer4_conv = tf.nn.conv2d(layer3_out, filter=layer4_weights, strides=[1,1,1,1], padding='SAME') | |
layer4_out = tf.nn.relu(layer4_conv + layer4_bias) | |
pool2 = tf.nn.max_pool(layer4_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID') | |
layer5_weights = tf.get_variable("layer5_weights", [3, 3, 128, 256], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer5_bias = tf.Variable(tf.zeros([256])) | |
layer5_conv = tf.nn.conv2d(pool2, filter=layer5_weights, strides=[1,1,1,1], padding='SAME') | |
layer5_out = tf.nn.relu(layer5_conv + layer5_bias) | |
layer6_weights = tf.get_variable("layer6_weights", [3, 3, 256, 256], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer6_bias = tf.Variable(tf.zeros([256])) | |
layer6_conv = tf.nn.conv2d(layer5_out, filter=layer6_weights, strides=[1,1,1,1], padding='SAME') | |
layer6_out = tf.nn.relu(layer6_conv + layer6_bias) | |
layer7_weights = tf.get_variable("layer7_weights", [3, 3, 256, 256], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer7_bias = tf.Variable(tf.zeros([256])) | |
layer7_conv = tf.nn.conv2d(layer6_out, filter=layer7_weights, strides=[1,1,1,1], padding='SAME') | |
layer7_out = tf.nn.relu(layer7_conv + layer7_bias) | |
pool3 = tf.nn.max_pool(layer7_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID') | |
layer8_weights = tf.get_variable("layer8_weights", [3, 3, 256, 512], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer8_bias = tf.Variable(tf.zeros([512])) | |
layer8_conv = tf.nn.conv2d(pool3, filter=layer8_weights, strides=[1,1,1,1], padding='SAME') | |
layer8_out = tf.nn.relu(layer8_conv + layer8_bias) | |
layer9_weights = tf.get_variable("layer9_weights", [3, 3, 512, 512], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer9_bias = tf.Variable(tf.zeros([512])) | |
layer9_conv = tf.nn.conv2d(layer8_out, filter=layer9_weights, strides=[1,1,1,1], padding='SAME') | |
layer9_out = tf.nn.relu(layer9_conv + layer9_bias) | |
layer10_weights = tf.get_variable("layer10_weights", [3, 3, 512, 512], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
layer10_bias = tf.Variable(tf.zeros([512])) | |
layer10_conv = tf.nn.conv2d(layer9_out, filter=layer10_weights, strides=[1,1,1,1], padding='SAME') | |
layer10_out = tf.nn.relu(layer10_conv + layer10_bias) | |
pool4 = tf.nn.max_pool(layer10_out, ksize=[1,2,2,1], strides=[1,2,2,1], padding='VALID') | |
shape = pool4.shape.as_list() | |
newShape = shape[1] * shape[2] * shape[3] | |
reshaped_pool4 = tf.reshape(pool4, [–1, newShape]) | |
fc1_weights = tf.get_variable("layer11_weights", [newShape, 4096], initializer=tf.contrib.layers.variance_scaling_initializer()) | |
fc1_bias = tf.Variable(tf.zeros([4096])) | |
fc1_out = tf.nn.relu(tf.matmul(reshaped_pool4, fc1_weights) + fc1_bias) | |
fc2_weights = tf.get_variable("layer12_weights", [4096, 10], initializer=tf.contrib.layers.xavier_initializer()) | |
fc2_bias = tf.Variable(tf.zeros([10])) | |
logits = tf.matmul(fc1_out, fc2_weights) + fc2_bias | |
cost = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels)) | |
learning_rate = 0.001 | |
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost) | |
#Add a few nodes to calculate accuracy and optionally retrieve predictions | |
predictions = tf.nn.softmax(logits) | |
correct_prediction = tf.equal(labels, tf.argmax(predictions, 1, output_type=tf.int32)) | |
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) | |
with tf.Session(graph=graph) as session: | |
tf.global_variables_initializer().run() | |
num_steps = 10000 | |
batch_size = 100 | |
for step in range(num_steps): | |
offset = (step * batch_size) % (train_labels.shape[0] – batch_size) | |
batch_images = train_images[offset😦offset + batch_size)] | |
batch_labels = train_labels[offset😦offset + batch_size)] | |
feed_dict = {input: batch_images, labels: batch_labels} | |
_, c, acc = session.run([optimizer, cost, accuracy], feed_dict=feed_dict) | |
if step % 100 == 0: | |
print("Cost: ", c) | |
print("Accuracy: ", acc * 100.0, "%") | |
#Test | |
num_test_batches = int(len(test_images) / 100) | |
total_accuracy = 0 | |
total_cost = 0 | |
for step in range(num_test_batches): | |
offset = (step * batch_size) % (train_labels.shape[0] – batch_size) | |
batch_images = test_images[offset😦offset + batch_size)] | |
batch_labels = test_labels[offset😦offset + batch_size)] | |
feed_dict = {input: batch_images, labels: batch_labels} | |
c, acc = session.run([cost, accuracy], feed_dict=feed_dict) | |
total_cost = total_cost + c | |
total_accuracy = total_accuracy + acc | |
print("Test Cost: ", total_cost / num_test_batches) | |
print("Test accuracy: ", total_accuracy * 100.0 / num_test_batches, "%") |