Part of the series Learn TensorFlow Now
Over the last nine posts, we built a reasonably effective digit classifier. Now we’re ready to enter the big leagues and try out our VGGNet on a more challenging image recognition task. CIFAR-10 (Canadian Institute For Advanced Research) is a collection of 60,000 cropped images of planes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks.
- 50,000 images in the training set
- 10,000 images in the test set
- Size: 32×32 (1024 pixels)
- 3 Channels (RGB)
- 10 output classes
CIFAR-10 is a natural next-step due to its similarities to the MNIST dataset. For starters, we have the same number of training images, testing images and output classes. CIFAR-10’s images are of size
32x32 which is convenient as we were paddding MNIST’s images to achieve the same size. These similarities make it easy to use our previous VGGNet architecture to classify these images.
Despite the similarities, there are some differences that make CIFAR-10 a more challenging image recognition problem. For starters, our images are RGB and therefore have 3 channels. Detecting lines might not be so easy when they can be drawn in any color. Another challenge is that our images are now 2-D depictions of 3-D objects. In the above image, the center two images represent the “truck” class, but are shown at different angles. This means our network has to learn enough about “trucks” to recognize them at angles it has never seen before.
The CIFAR-10 dataset is hosted at: https://www.cs.toronto.edu/~kriz/cifar.html
In order to make it easier to work with, I’ve prepared a small script that downloads, shuffles and caches the dataset locally. You can find it on GitHub here.
After saving this file locally, we can use it to prepare our datasets:
Running this locally produces the following output:
Attempting to download: https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz 0%....5%....10%....15%....20%....25%....30%....35%....40%....45%....50%....55%....60%....65%....70%....75%....80%....85%....90%....95%....100% Download Complete! (50000, 32, 32, 3) (50000,) (10000, 32, 32, 3) (10000,) (32, 32, 3)
The above output shows that we’ve downloaded the dataset and created a training set of size
50,000 and a test set of size
10,000. Note: Unlike MNIST, these labels are not 1-hot encoded (otherwise they’d be of size
10,000x10 respectively). We have to account for this difference in shape when we build VGGNet for this dataset.
Let’s start by adjusting
labels to fit the CIFAR-10 dataset:
Next we have to adjust the first layer of our network. Recall from the post on convolutions that each convolutional filter must match the depth of the layer against which it is convolved. Previously we had defined our convolutional filter to be of shape
[3, 3, 1, 64]. That is, a
3x3 convolutional filters, each with depth of
1, matching the depth of our grayscale input image. Now that we’re using RGB images, we must define it to be of shape
[3, 3, 3, 64]:
Another change we must make is the calculation of
cost. Previously we were using
tf.nn.softmax_cross_entropy_with_logits() which is suitable only when our labels are 1-hot encoded. When we represent the labels as single integers, we can instead use
tf.nn.sparse_softmax_cross_entropy_with_logits(). It is otherwise identical to our original softmax cross entropy function.
Finally, we must also modify our calculation of
correction_prediction (used to calculate
accuracy) to account for the change in label shape. We no longer have to take the
tf.argmax of our labels because they’re already represented as a single number:
Note: We have to specify
tf.int64 by default.
With that, we’ve got everything we need to test our VGGNet on CIFAR-10. The complete code is presented at the end of this post.
After running our network for
10,000 steps, we’re greeted with the following output:
Cost: 470.996 Accuracy: 9.00000035763 % Cost: 2.00049 Accuracy: 25.0 % ... Cost: 0.553867 Accuracy: 82.9999983311 % Cost: 0.393799 Accuracy: 87.0000004768 % Test Cost: 0.895597087741 Test accuracy: 70.9400003552 %
Our final test accuracy appears to be approximately 71%, which isn’t too great. On one hand this is disappointing as it means our VGGNet architecture (or the method in which we’re training it) doesn’t generalize very well. On the other hand, CIFAR-10 presents us with new opportunities to try out new neural network components and architectures. In the next few posts we’ll explore some of these approaches to build a neural network that can handle the more complex CIFAR-10 dataset.
If you look carefully at the previous results you may have noticed something interesting. For the first time, our test accuracy (71%) is much lower than our training accuracy (~82-87%). This is a problem we’ll discuss in next week’s post on bias and variance in deep learning.