LTFN 6: Weight Initialization

Part of the series Learn TensorFlow Now

At the conclusion of the previous post, we realized that our first convolutional net wasn’t performing very well. It had a comparatively high cost (something we hadn’t seen before) and was performing slightly worse than a fully-connected network with the same number of layers.

Test results from 4-layer fully connected network:

Test Cost:  107.98408660641905
Test accuracy:  85.74999994039536 %

Test results from 4-layer Conv Net:

Test Cost: 15083.0833307
Test accuracy: 81.8799999356 %


As a refresher, here’s a visualization of the 4-layer ConvNet we built in the last post:

Visualization of layers from input through pool2 (Click to enlarge).

So how do we figure out what’s broken?

When writing any typical program we might fire up a debugger or even just use something like printf() to figure out what’s going on. Unfortunately neural networks makes this very difficult for us. We can’t really step through thousands of multiplication, addition and ReLU operations and expect to glean much insight. One common debugging technique is to visualize all of the intermediate outputs and try to see if there are any obvious problems.

Let’s take a look at a histogram of the outputs of each layer before they’re passed through the ReLU non-linearity. (Remember, the ReLU operation simply chops off all negative values).

If you look closely at the above plots you’ll notice that the variance increases substantially at each layer (TensorBoard doesn’t let me adjust the scales of each plot so it’s not immediately obvious). The majority of outputs at layer1_conv are within the range [-1,1], but by the time we get to layer4_conv the outputs vary between [-20,000, 20,000]. If we continue adding layers to our network this trend will continue and eventually our network will run into problems with overflow. In general we’d prefer our intermediate outputs to remain within some fixed range.

How does this relate to our high cost? Let’s take a look at the values of our logits and  predictions. Recall that these values are calculated via:


The first thing to notice is that like the previous layers, the values of logits have a large variance with some values in the hundreds of thousands. The second thing to notice is that once we take the softmax of logits to create predictions all of our values are reduced to either 1 or 0. Recall that tf.nn.softmax takes logits and ensures that the ten values add up to 1 and that each value represents the probability a given image is represented by each digit. When some of our logits are tens of thousands of times bigger than the others, these values end up dominating the probabilities.

The visualization of predictions tells us that our network is super confident about the predictions it’s making. Essentially our network is claiming that it is 99% sure of its predictions. Whenever our network makes a mistake it is making a huge mistake and receives a large cost penalty for it.

The problem with increasing (magnitude) intermediate outputs translates directly into an increased cost. So how do we fix this? We want restrict the magnitude of the intermediate outputs of our network so they don’t increase so drastically at each layer.


Smaller Initial Weights

Recall that each convolution operation takes the dot product of our weights with a portion of the input. Basically, we’re multiplying and adding up a bunch of numbers similar to the following:

w0*i0 + w1*i1 + w2*i2 + … wn*in = output


  • wx – Represents a single weight
  • ix – Represents a single input (eg. pixel)
  • n – The number of weights

One way to reduce the magnitude of this expression is to reduce the magnitude of all of our weights by some factor:

0.01*w0*i0 + 0.01*w1*i1 + 0.01*w2*i2 + … 0.01*wn*in = 0.01*output

Let’s try it and see if it works! We’ll modify the creation of our weights by multiplying them all by 0.01. Therefore layer1_weights would now be defined as:

After changing all five sets of weights (don’t forget about the fully-connected layer at the end), we can run our network and see the following test cost and accuracies:

Test Cost:  2.3025865221
Test accuracy:  5.01999998465 %

Yikes! The cost has decreased quite a bit, but that accuracy is abysmal… What’s going on this time? Let’s take a look at the intermediate outputs of the network:

If you look closely at the scales, you’ll see that this time the intermediate outputs are decreasing! The first layer’s outputs lie largely within the interval [-0.02, 0.02] while the fourth layer generates outputs that lie within [-0.0002, 0.0002]. This is essentially the opposite of the problem we saw before.

Let’s also examine the logits and predictions as we did before:


This time the logits vary over a very small interval [-0.003, 0.003] and predictions are completely uniform. The predictions appear to be centered around 0.10 which seems to indicate that our network is simply predicting each of the ten digits with 10% probability. In other words, our network is learning nothing at all and we’re in an even worse state than before!


Choosing the Perfect Initial Weights

What we’ve learned so far:

  1. Large initial weights lead to very large output in intermediate layers and an over-confident network.
  2. Small initial weights lead to very small output in intermediate layers and a network that doesn’t learn anything.

So how do we choose initial weights that are not too small and not too large? In 2013, Xavier Glorot and Yoshua Bengio published Understanding the difficulty of training deep forward neural networks in which they proposed initializing a set of weights based on how many input and output nerons are present for a given weight. For more on this initialization scheme see An Explanation of Xavier Initialization. This initialization scheme is called Xavier Initialization.

It turns out that Xavier Initialization does not work for layers using the asymmetric ReLU activation function. So while we can use it on our fully connected layer we can’t use it for our intermediate layers. However in 2015 Microsoft Research (Kaiming He et al.) published Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In this paper they introduced a modified version of Xavier Initialization called Variance Scaling Initialization.

The math behind these initialization schemes is out of scope for this post, but TensorFlow makes them easy to use. I recommend simply remembering:

  1. Use Xavier Initialization in the fully-connected layers of your network. (Or layers that use softmax/tanh activation functions)
  2. Use Variance Scaling Initialization in the intermediate layer of your network that use ReLU activation functions.

We can modify the initialization of layer1_weights from tf.random.normal to use tf.contrib.layers.variance_scaling_initializer() as follows:

We can also modify the fully connected layer’s weights to use tf.contrib.xavier_initializer as follows:

There are a few small changes to note here. First, we use tf.get_variable instead of calling tf.Variable directly. This allows us to pass in a custom initializer for our weights. Second, we have to provide a unique name for our variable. Typically I just use the same name as my variable name.

If we continue changing all the weights in our network and run it, we can see the following output:

Cost: 2.49579
Accuracy: 9.00000035763 %
Cost: 1.05762
Accuracy: 77.999997139 %
Cost: 0.110656
Accuracy: 94.9999988079 %
Test Cost: 0.0945288215741
Test accuracy: 97.2900004387 %


Much better! This is a big improvement over our previous results and we can see that both cost and accuracy have improved substantially. For the sake of curiosity, let’s look at the intermediate outputs of our network:



This looks much better. The variance of the intermediate values appears to increase only slightly as we move through the layers and all values are within about an order of magnitude of one another. While we can’t make any claims about the intermediate outputs being “perfect” or even “good”, we can at least rest assured that there is no glaringly obvious problems with them. (Sidenote: This seems to be a common theme in deep learning: We usually can’t prove we’ve done things correctly, we can only look for signs that we’ve done them incorrectly).


Thoughts on Weights

Hopefully I’ve managed to convince you of the importance of choosing good initial weights for a neural network. Fortunately when it comes to image recognition, there are well-known initialization schemes that pretty much solve this problem for us.

The problems with weight initialization should highlight the fragility of deep neural networks. After all, we would hope that even if we choose poor initial weights, after enough time our gradient descent optimizer would manage to correct them and settle on good values for our weights. Unfortunately that doesn’t seem to be the case, and our optimizer instead settles into a relatively poor local minima.


Complete Code


LTFN 5: Building a ConvNet

Part of the series Learn TensorFlow Now

In the last post we looked at the building blocks of a convolutional neural net. The convolution operation works by sliding a filter along the input and taking the dot product at each location to generate an output volume.

The parameters we need to consider when building a convolutional layer are:

1. Padding – Should we pad the input with zeroes?
2. Stride – Should we move the filter more than one pixel at a time?
3. Input depth – Each convolutional filter must have a depth that matches the input depth.
4. Number of filters – We can stack multiple filters to increase the depth of the output.

With this knowledge we can construct our first convolutional neural network. We’ll start by creating a single convolutional layer that operates on a batch of input images of size 28x28x1.

Visualization of layer1 with the corresponding dimensions marked.

We start by creating a 4-D Tensor for layer1_weights. This Tensor represents the weights of the various filters that will be used in our convolution and then trained via gradient descent. By default, TensorFlow uses the format [filter_height, filter_width, in_depth, out_depth] for convolutional filters. In this example, we’re defining 64 filters each of which has a height of 3, width of 3, and an input depth of 1.


It’s important to remember that in_depth must always match the depth of the input we’re convolving. If our images were RGB, we would have had to create filters with a depth of 3.

On the other hand, we can increase or decrease output depth simply by changing the value we specify for out_depth. This represents how many independent filters we’ll create and therefore the depth of the output. In our example, we’ve specified 64 filters and we can see layer1_conv has a corresponding depth of 64.


Stride represents how fast we move the filter along each dimension. By default, TensorFlow expects stride to be defined in terms of [batch_stride, height_stride, width_stride, depth_stride]. Typically, batch_stride and depth_stride are always 1 as we don’t want to skip over examples in a batch or entire slices of volume. In the above example, we’re using strides=[1,1,1,1] to specify that we’ll be moving the filters across the image one pixel at a time.


TensorFlow allows us to specify either SAME or VALID padding. VALID padding does not pad the image with zeroes. Specifying SAME pads the image with enough zeroes such that the output will have the same height and with dimensions as the input assuming we’re using a stride of 1. Most of the time we use SAME padding so as not to have the output shrink at each layer of our network. To dig into the specifics of how padding is calculated, see TensorFlow’s documentation on convolutions.


Finally, we have to remember to include a bias term for each filter. Since we’ve created 64 filters, we’ll have to create a bias term of size 64. We apply bias after performing the convolution operation, but before passing the result to our ReLU non-linearity.


Max Pooling

As the above shows, as the input flows through our network, intermediate representations (eg. layer1_out) keep the same width and height while increasing in depth. However, if we continue making deeper and deeper representations we’ll find that the number of operations we need to perform will explode. Each of the filters has to be dragged across as 28x28 input and take the dot-product. As our filters get deeper this results in larger and larger groups of multiplications and additions.

Periodically we would like to downsample and compress our intermediate representations to have smaller height and width dimensions. The most common way to do this is by using a max pooling operation.

Max pooling is relatively simple. We slide a window (also called a kernel) along the input and simply take the max value at each point. As with convolutions, we can control the size of the sliding window, the stride of the window and choose whether or not to pad the input with zeroes.

Below is a simple example demonstrating max pooling on an unpadded input of 4x4 with a kernel size of 2x2 and a stride of 2:

Max pooling is the most popular way to downsample, but it’s certainly not the only way. Alternatives include average-pooling, which takes the average value at each point or vanilla convolutions with stride of 2. For more on this approach see: The All Convolutional Net.

The most common form of max pooling uses a 2x2 kernel (ksize=[1,2,2,1]) and a stride of 2 in the width and height dimensions (stride=[1,2,2,1]).


Putting it all together

Finally we have all the pieces to build our first convolutional neural network. Below is a network with four convolutional layers and two max pooling layers (You can find the complete code at the end of this post).


Before diving into the code, let’s take a look at a visualization of our network from input through pool2 to get a sense of what’s going on:

Visualization of layers from input through pool2 (Click to enlarge).


There are a few things worth noticing here. First, notice that in_depth of each set of convolutional filters matches the depth of the previous layers. Also note that the depth of each intermediate layer is determined by the number of filters (out_depth) at each layer.

We should also notice that every pooling layer we’ve used is a 2x2 max pooling operation using a stride=[1,2,2,1]. Recall the default format for stride is [batch_stride, height_stride, width_stride, depth_stride]. This means that we slide through the height and width dimensions twice as fast as depth. This results in a shrinkage of height and width by a factor of 2. As data moves through our network, the representations become deeper with smaller width and height dimensions.

Finally, the last six lines are a little bit tricky. At the conclusion of our network we need to make predictions about which number we’re seeing. The way we do that is by adding a fully connected layer at the very end of our network. We reshape pool2 from a 7x7x128 3-D volume to a single vector with 6,272 values. Finally, we connect this vector to 10 output logits from which we can extract our predictions.

With everything in place, we can run our network and take a look at how well it performs:

Cost: 979579.0
Accuracy: 7.0000000298 %
Cost: 174063.0
Accuracy: 23.9999994636 %
Cost: 95255.1
Accuracy: 47.9999989271 %


Cost: 10001.9
Accuracy: 87.9999995232 %
Cost: 16117.2
Accuracy: 77.999997139 %
Test Cost: 15083.0833307
Test accuracy: 81.8799999356 %


Yikes. There are two things that jump out at me when I look at these numbers:

  1. The cost seems very high despite achieving a reasonable result.
  2. The test accuracy has decreased when compared to our fully-connected network which achieved an accuracy of ~89%


So are convolutional nets broken? Was all this effort for nothing? Not quite. Next time we’ll look at an underlying problem with how we’re choosing our initial random weight values and an improved strategy that should improve our results beyond that of our fully-connected network.


Complete Code


LTFN 4: Intro to Convolutional Neural Networks

Part of the series Learn TensorFlow Now

The neural networks we’ve built so far have had a relatively simple structure. The input to each layer is fully connected to the output of the previous layer. For this reason, these layers are commonly called fully connected layers.

Two fully connected layers in a neural network.

This has been mathematically convenient because we’ve been able to represent each layer’s output as a matrix multiplication of the previous layer’s output (a vector) with the current layer’s weights.

However, as we build more complex networks for image recognition, there are certain properties we want that are difficult to get from fully connected layers. Some of these properties include:

  1. Translational Invariance – A fancy phrase for “A network trained to recognize cats should recognize cats equally well if they’re in the top left of the picture or the bottom right of the picture”. If we move the cat around the image, we should still expect to recognize it.

    Translational invariance suggests we should recognize objects regardless of where they’re located in the image.
  2. Local Connectivity – This means that we should take advantage of features within a certain area of the image. Remember that in previous posts we treated the input as a single row of pixels. This meant that local features (e.g. edges, curves, loops) are very hard for our networks to identify and pick out. Ideally our network should try to identify patterns than occur within local regions of the image and use these patterns to influence its predictions.


Today we’re going look at one of the most successful classes of neural networks: Convolutional Neural Networks. Convolutional Neural Networks have been shown to give us both translational invariance and local connectivity.

The building block of a convolutional neural network is a convolutional filter. It is a square (typically 3x3) set of weights. The convolutional filter looks at pieces of the input of the same shape. As it does, it takes the dot product of the weights with the input and saves the result in the output. The convolutional filter is dragged along the entire input until the entire input has been covered. Below is a simple example with a (random) 5x5 input and a (random) 3x3 filter.


So why is this useful? Consider the following examples with a vertical line in the input and a 3×3 filter with weights chosen specifically to detect vertical edges.

Vertical edge detection from light-to-dark.
Vertical edge detection from dark-to-light.

We can see that with hand-picked weights, we’re able to generate patterns in the output. In this example, light-to-dark transitions produce large positive values while dark-to-light transitions produce large negative values. Where there is no change at all, the filter will simply produce zeroes.

While we’ve chosen the above filter’s weights manually, it turns out that training our network via gradient descent ends up selecting very good weights for these filters. As we add more convolutional layers to our network they begin to be able to recognize more abstract concepts such as faces, whiskers, wheels etc.



You may have noticed that the output above has a smaller width and height than the original input. If we pass this output to another convolutional layer it will continue to shrink. Without dealing with this shrinkage, we’ll find that this puts an upper bound on how many convolutional layers we can have in our network.

SAME Padding

The most common way to deal with this shrinkage is to pad the entire image with enough zeroes such that the output shape will have the same width and height as the input. This is called SAME padding and allows us to continue passing the output to more and more convolutional layers without worrying about shrinking width and height dimensions. Below we take our first example (5×5 input) and pad it with zeroes to make sure the output is still 5×5.


A 5x5 input padded with zeroes to generate a 5x5 output.

VALID Padding

VALID padding does not pad the input with anything. It probably would have made more sense to call it NO padding or NONE padding.

VALID padding results in shrinkage in width and height.



So far we’ve been moving the convolutional filter across the input one pixel at a time. In other words, we’ve been using a stride=1. Stride refers to the number of pixels we move the filter in the width and height dimension every time we compute a dot-product.  The most common stride value is stride=1, but certain algorithms require larger stride values. Below is an example using stride=2.

Notice that larger stride values result in larger decreases in output height and width. Occasionally this is desirable near the start of a network when working with larger images. Smaller input width and height can make the calculations more manageable in deeper layers of the network.


Input Depth

In our previous examples we’ve been working with inputs that have variable height and width dimensions, but no depth dimension. However, some images (e.g. RGB) have depth, and we need some way to account for it. The key is to extend our filter’s depth dimension to match the depth dimension of the input.

Unfortunately, I lack the animation skills to properly show an animated example of this, but the following image may help:

Convolution over an input with a depth of 2 using a single filter with a depth of 2.

Above we have an input of size 5x5x2 and a single filter of size 3x3x2. The filter is dragged across the input and once again the dot product is taken at each point. The difference here is that there are 18 values being added up at each point (9 from each depth of the input image). The result is an output with a single depth dimension.


Output Depth

We can also control the output depth by stacking up multiple convolutional filters. Each filter acts independently of one another while computing its results and then all of the results are stacked together to create the ouptut. This means we can control output depth simply by adding or removing convolutional filters.

Two convolutional filters result in a output depth of two.

It’s very important to note that there are two distinct convolutional filters above. The weights of each convolutional filter are distinct from the weights of the other convolutional filter. Each of these two filters has a shape of 3x3x2. If we wanted to get a deeper output, we could continue stacking more of these 3x3x2 filters on top of one another.

Imagine for a moment that we stacked four convolutional filters on top of one another, each with a set of weights trained to recognize different patterns. One might recognize horizontal edges, one might recognize vertical edges, one might recognize diagonal edges from top-left to bottom-right and one might recognize diagonal edges from bottom-left to top-right. Each of these filters would produce one depth layer of the output with values where their respective edges were detected. Later layers of our network would be able to act on this information and build up even more complex representations of the input.


Next up

There is a lot to process in this post. We’ve seen a brand new building block for our neural networks called the convolutional filter and a myriad of ways to customize it. In the next post we’ll implement our first convolutional neural network in TensorFlow and try to better understand practical ways to use this building block to build a better digit recognizer.

LTFN 3: Deeper Networks

Part of the series Learn TensorFlow Now

In the last post, we saw our network achieve about 60% accuracy. One common way to improve a neural network’s performance is to make it deeper. Before we start adding layers to our network, it’s worth taking a moment to explore one of the key advantages of deep neural networks.

Historically, a lot of effort was invested in crafting hand-engineered features that could be fed to shallow networks (or other learning algorithms). In image detection we might modify the input to highlight horizontal or vertical edges. In voice recognition we might filter out noise or various frequencies not typically found in human speech. Unfortunately, hand-engineering features often required years of expertise and lots of time.

Below is a network created with TensorFlow Playground that demonstrates this point. By feeding modified versions of the input to a shallow network, we are able to train it to recognize a non-linear spiral pattern.

A shallow network requires various modifications to the input features to classify the “Swiss Roll” problem.

A shallow network is capable of learning complex patterns only when fed modified versions of the input. A key idea behind deep learning is to do away with hand-engineered features whenever possible. Instead, by making the network deeper, we can convince the network to learn the features it really needs to solve the problem. In image recognition, the first few layers of the network learn to recognize simple features (eg. edge detection), while deeper layers respond to more complex features (eg. human faces). Below, we’ve made the network deeper and removed all dependencies on additional features.

A deep network is capable of classifying the points in a “Swiss Roll” using only the original input.


Making our network deeper

Let’s try making our network deeper by adding two more layers. We’ll replace layer1_weights and layer1_bias with the following:

Note: When discussing the network’s shapes, I ignore the batch dimension. For example, where a shape is [None, 784] I will refer to it as a vector with 784 elements. I find it helps to imagine a batch size of 1 to avoid having to think about more complex shapes.

The first thing to notice is the change in shape. layer1 now accepts an input of 784 values and produces an intermediate vector layer1_output with 500 elements. We then take these 500 values through layer2 which also produces an intermediate vector layer2_output with 500 elements. Finally, we take these 500 values through layer3 and produce our logit vector with 10 elements.

Why did I choose 500 elements? No reason, it was just an arbitrary value that seemed to work. If you’re following along at home, you could try adding more layers or making them wider (ie. use a size larger than 500).


Another important change is the addition of tf.nn.relu() in layer1 and layer2. Note that it is applied to the result of the matrix multiplication of the previous layer’s output with the current layer’s weights.

So what is a ReLU? ReLU stands for “Rectified Linear Unit” and is an activation function. An activation function is applied to the output of each layer of a neural network. It turns out that if we don’t include activation functions, it can be mathematically shown (by people much smarter than me) that our three layer network is equivalent to a single layer network. This is obviously a BadThing™ as it means we lose all the advantages of building a deep neural network.

I’m (very obviously) glossing over the details here, so if you’re new to neural networks and want to learn more see: Why do you need non-linear activation functions?

Other historical activation functions include sigmoid and tanh. These days, ReLU is almost always the right choice of activation function and we’ll be using it exclusively for our networks.

Graphs for ReLU, sigmoid and tanh functions


Learning Rate

Finally, one other small change needs to be made: The learning rate needs to be changed from 0.01 to 0.0001. Learning rate is one of the most important, but most finicky hyperparameters to choose when training your network. Too small and the network takes a very long time to train, too large and your network doesn’t converge. In later posts we’ll look at methods that can help with this, but for now I’ve just used the ol’ fashioned “Guess and Check” method until I found a learning rate that worked well.


Alchemy of Hyperparameters

We’ve started to see a few hyperparameters that we must choose when building a neural network:

  • Number of layers
  • Width of layers
  • Learning rate

It’s an uncomfortable reality that we have no good way to choose values for these hyperparameters. What’s worse is that we typically can’t explain why a certain hyperparameter value works well and others do not. The only reassurance I can offer is:

  1. Other people think this is a problem
  2. As you build more networks, you’ll develop a rough intuition for choosing hyperparameter values


Putting it all together

Now that we’ve chosen a learning rate and created more intermediate layers, let’s put it all together and see how our network performs.




After running this code you should see output similar to:

Cost:  4596.864
Accuracy:  7.999999821186066 %
Cost:  882.4881
Accuracy:  30.000001192092896 %
Cost:  609.4177
Accuracy:  51.99999809265137 %
Cost:  494.5303
Accuracy:  56.00000023841858 %


Cost:  57.793114
Accuracy:  89.99999761581421 %
Cost:  148.92995
Accuracy:  81.00000023841858 %
Cost:  67.42319
Accuracy:  89.99999761581421 %
Test Cost:  107.98408660641905
Test accuracy:  85.74999994039536 %


Our network has improved from 60% accuracy to 85% accuracy. This is great progress, clearly things are moving in the right direction! Next week we’ll look at a more complicated neural network structure called a “Convolutional Neural Network” which is one of the basic building blocks of today’s top image classifiers.

For the sake of completeness, I’ve included a TensorBoard visualization of the network we’ve created below:

Visualization of our three-layer network with `layer1` expanded. Notice the addition of `layer1_output` following the addition with `layer1_bias`. This represents the ReLU activation function.





LTFN 2: Graphs and Shapes

Part of the series Learn TensorFlow Now

TensorFlow Graphs

Before we improve our network, we have to take a moment to chat about TensorFlow graphs. As we saw in the previous post, we follow two steps when using TensorFlow:

  1. Create a computational graph
  2. Run data through the graph using

Let’s take a look at what’s actually happening when we call Consider our graph and session code from last time:

When we pass optimizer and cost to, TensorFlow looks at the dependencies for these two nodes. For example, we can see above that optimizer depends on:

  • cost
  • layer1_weights
  • layer1_bias
  • input

We can also see that cost depends on:

  • logits
  • labels

When we wish to evaluate optimizer and cost, TensorFlow first runs all the operations defined by the previous nodes, then calculates the required results and returns them. Since every node ends up being a dependency of optimizer and cost, this means that every operation in our TensorFlow graph is executed with every call to

But what if we don’t want to run every operation? If we want to pass test data to our network, we don’t want to run the operations defined by optimizer. (After all, we don’t want to train our network on our test set!) Instead, we’d just want to extract predictions from logits. In that case, we could instead run our network as follows:

This would execute only the subset of nodes required to compute the values of logits, highlighted below:

Our computational graph with only dependencies of logits highlighted in orange.

Note: As labels is not one of the dependencies of logits we don’t need to provide it.

Understanding the dependencies of the computational graphs we create is important. We should always try to be aware of exactly what operations will be running when we call to avoid accidentally running the wrong operations.



Another important topic to understand is how TensorFlow shapes work. In our previous post all our shapes were completely defined. Consider the following tf.Placeholders for input and labels:

We have defined these tensors to have a 2-D shape of precisely (100, 784) and (100, 10). This restricts us to a computational graph that always expects 100 images at a time. What if we have a training set that isn’t divisible by 100? What if we want to test on single images?

The answer is to use dynamic shapes. In places where we’re not sure what shape we would like to support, we just substitute in None. For example, if we want to allow variable batch sizes, we simply write:

Now we can pass in batch sizes of 1, 10, 283 or any other size we’d like. From this point on, we’ll be defining all of our tf.Placeholders in this fashion.



One important question remains: “How well is our network doing?“. In the previous post, we saw cost decreasing, but we had no concrete metric against which we could compare our network. We’ll keep things simple and use accuracy as our metric. We just want to measure the average number of correction predictions:

In the first line, we convert logits to a set of predictions using tf.nn.softmax. Remember that our labels are 1-hot encoded, meaning each one contains 10 numbers, one of which is 1. logits is the same shape, but the values in logits can be almost anything. (eg. values in logits could be -4, 234, 0.5 and so on). We want our predictions to have a few qualities that logits does not possess:

  1. The sum of the values in predictions for a given image should be 1
  2. No values in predictions should be greater than 1
  3. No values in predictions should be negative
  4. The highest value in predictions will be our prediction for a given image. (We can use argmax to find this)

Applying tf.nn.softmax() to logits gives us these desired properties. For more details on softmax, watch this video by Andrew Ng.

The second line takes the argmax of our predictions and of our labels. Then tf.equal creates a vector that contains either True (when the values match) and False when the values don’t match.

Finally, we use tf.reduce_mean to calculate the average number of times we get the prediction correct for this batch. We store this result in accuracy.

Putting it all together

Now that we better understand TensorFlow graphs, shape and have a metric with which to judge our algorithm, let’s put it all together to evaluate our performance on the test set, after training has finished.

Note that almost all of the new code relates to running the test set.

One question you might ask is: Why not just predict all the test images at once, in one big batch of 10,000? The problem is that when we train larger networks on our GPU, we won’t be able to fit all 10,000 images and the required operations in our GPU’s memory. Instead we have to process the test set in batches similar to how we train the network.

Finally, let’s run it and look at the output. When I run it on my local machine I receive the following:

Cost:  20.207457
Accuracy:  7.999999821186066 %
Cost:  10.040323
Accuracy:  14.000000059604645 %
Cost:  8.528659
Accuracy:  14.000000059604645 %
Cost:  6.8867884
Accuracy:  23.999999463558197 %
Cost:  7.1556334
Accuracy:  21.99999988079071 %
Cost:  6.312024
Accuracy:  28.00000011920929 %
Cost:  4.679361
Accuracy:  34.00000035762787 %
Cost:  5.220028
Accuracy:  34.00000035762787 %
Cost:  5.167577
Accuracy:  23.999999463558197 %
Cost:  3.5488296
Accuracy:  40.99999964237213 %
Cost:  3.2974648
Accuracy:  43.00000071525574 %
Cost:  3.532155
Accuracy:  46.99999988079071 %
Cost:  2.9645846
Accuracy:  56.00000023841858 %
Cost:  3.0816755
Accuracy:  46.99999988079071 %
Cost:  3.0201495
Accuracy:  50.999999046325684 %
Cost:  2.7738256
Accuracy:  60.00000238418579 %
Cost:  2.4169116
Accuracy:  55.000001192092896 %
Cost:  1.944017
Accuracy:  60.00000238418579 %
Cost:  3.5998762
Accuracy:  50.0 %
Cost:  2.8526196
Accuracy:  55.000001192092896 %
Test Cost:  2.392377197146416
Test accuracy:  59.48999986052513 %
Press any key to continue . . .

So we’re getting a test accuracy of ~60%. This is better than chance, but it’s not as good as we’d like it to be. In the next post, we’ll look at different ways of improving the network.

LTFN 1: Intro to TensorFlow

Part of the series Learn TensorFlow Now

Over the next few posts, we’ll build a neural network that accurately reads handwritten digits. We’ll go step-by-step, starting with the basics of TensorFlow and ending up with one of the best networks in the ILSCRC 2013 image recognition competition.

MNIST Dataset

The MNIST dataset is one of the simplest image datasets and makes for a perfect starting point. It consists of 70,000 images of handwritten digits. Our goal is to build a neural network that can identify the digit in a given image.

  • 60,000 images in the training set
  • 10,000 images in the test set
  • Size: 28×28 (784 pixels)
  • 1 Channel (ie. not RGB)
Sample images from MNIST

To start, we’ll import TensorFlow and our dataset:

TensorFlow makes it easy for us to download the MNIST dataset and save it locally. Our data has been split into a training set on which our network will learn and a test set against which we’ll check how well we’ve done.

Note: The labels are represented using one-hot encoding which means:

0 is represented by 1 0 0 0 0 0 0 0 0 0
1 is represented by 0 1 0 0 0 0 0 0 0 0

9 is represented by 0 0 0 0 0 0 0 0 0 1

Note: By default, the images are represented as arrays of 784 values. Below is a sample of what this might look like for a given image:

TensorFlow Graphs

There are two steps to follow when training our own neural networks with TensorFlow:

  1. Create a computational graph
  2. Run data through the graph so our network can learn or make predictions

Creating a Computational Graph

We’ll start by creating the simplest possible computational graph. Notice in the following code that there is nothing that touches the actual MNIST data. We are simply creating a computational graph so that we may later feed our data to it.

For first-time TensorFlow users there’s a lot to unpack in the next few lines, so we’ll take it slow.

Before explaining anything, let’s take a quick look at the network we’ve created. Below are two different visualizations of this network at different granularities that tell slightly different stories about what we’ve created.

Left: A functional visualization of our single layer network. The 784 input values are each multiplied by a weight which feeds into our ten logits.
Right: The graph created by TensorFlow, including nodes that represent our optimizer and cost.The first two lines of our code simply define a TensorFlow graph and tell TensorFlow that all the following operations we define should be included in this graph.

Next, we use tf.Placeholder to create two “Placeholder” nodes in our graph. These are nodes for which we’ll provide values every time we run our network. Our placeholders are:

  • input which will contain batches of 100 images, each with 784 values
  • labels which will contain batches of 100 labels, each with 10 values

Next we use tf.Variable to create two new nodes, layer1_weights and layer1_biases. These represent parameters that the network will adjust as we show it more and more examples. To start, we’ve made layer1_weights completely random, and layer1_biases all zero. As we learn more about neural networks, we’ll see that these aren’t the greatest choice, but they’ll work for now.

After creating our weights, we’ll combine them using tf.matmul to matrix multiply them against our input and + to add this result to our bias. You should note that + is simply a convenient shorthand for tf.add.  We store the result of this operation in logits and will consider the output node with the highest value to be our network’s prediction for a given example.

Now that we’ve got our predictions, we want to compare them to the labels and determine how far off we were. We’ll do this by taking the softmax of our output and then use cross entropy as our measure of “loss” or cost. We can perform both of these steps using tf.nn.softmax_cross_entropy_with_logits. Now we’ve got a measure of loss for all the examples in our batch, so we’ll just take the mean of these as our final cost.

The final step is to define an optimizer. This creates a node that is responsible for automatically updating the tf.Variables (weights and biases) of our network in an effort to minimize cost. We’re going to use the vanilla of optimizers: tf.train.GradientDescentOptimizer. Note that we have to provide a learning_rate to our optimizer. Choosing an appropriate learning rate is one of the difficult parts of training any new network. For now we’ll arbitrarily use 0.01 because it seems to work reasonably well.


Running our Neural Network

Now that we’ve created the network it’s time to actually run it. We’ll pass 100 images and labels to our network and watch as the cost decreases.

The first line creates a TensorFlow Session for our graph. The session is used to actually run the operations defined in our graph and produce results for us.

The second line initializes all of our tf.Variables. In our example, this means choosing random values for layer1_weights and setting layer1_bias to all zeros.

Next, we create a loop that will run for 1,000 training steps with a batch_size of 100. The first three lines of the loop simply select out 100 images and labels at a time. We store batch_images and batch_labels in feed_dict. Note that the keys of this dictionary input and labels correspond to the tf.Placeholder nodes we defined when creating our graph. These names must match, and all placeholders must have a corresponding entry in feed_dict.

Finally, we run the network using where we pass in feed_dict. Notice that we also pass in optimizer and cost. This tells TensorFlow to evaluate these nodes and to store the results from the current run in o and c. In the next post, we’ll touch more on this method, and how TensorFlow executes operations based on the nodes we supply to it here.


Now that we’ve put it all together, let’s look at the (truncated) output:

 Cost: 12.673884
 Cost: 11.534428
 Cost: 8.510129
 Cost: 9.842179
 Cost: 11.445622
 Cost: 8.554568
 Cost: 9.342157
 Cost: 4.811098
 Cost: 4.2431364
 Cost: 3.4888883
 Cost: 3.8150232
 Cost: 4.206609
 Cost: 3.2540445

Clearly the cost is going down, but we still have many unanswered questions:

  • What is the accuracy of our trained network?
  • How do we know when to stop training? Was 1,000 steps enough?
  • How can we improve our network?
  • How can we see what its predictions actually were?

We’ll explore these questions in the next few posts as we seek to improve our performance.

Complete source


2017: A Retrospective

2017 saw a lot of change for me. I left Microsoft, returned to Toronto and shifted my focus from developer tools to machine learning. Following in patio11’s footsteps, I wanted to take a moment to reflect on the year and clarify my thoughts in written form.


July 2016 – July 2017

In July 2016, my company Code Connect joined Microsoft with the stated goal of integrating our Alive extension into Visual Studio 2017. Unfortunately our visas were delayed and we weren’t able to begin work until early September. It didn’t make sense to rush Alive into Visual Studio 2017 and risk introducing stability problems so we held off for a few months.

In the meantime, we conducted a number of experiments to try to quantify demand for Alive so we could compare it to other potential features. When the experiments concluded, the data didn’t demonstrate an immediate need for a product like Alive. Alive was put on ice and we worked on features that were immediately pressing to Visual Studio’s success. At the time this meant focusing on accessibility, a top-down directive from Satya Nadella himself.

After Alive was sunset, I did some reflection on where I wanted to take my career and what I wanted to focus on. For a long time I’ve worred that the knowledge I’ve accumulated writing Visual Studio extensions is not immediately applicable outside of Visual Studio. Despite working on developer tools for almost five years, I would be on near-equal footing with a junior developer when it comes to developing extensions for VS Code or Eclipse. Much of my expertise comes in the form of random bits of trivia about Visual Studio. The problems I was faced with were challenging, but not very interesting.

The more I complained, the more I realized something had to change. In July 2017 I left Microsoft and applied for a position at a company for which I’d previously been an intern. However I didn’t have the C++ experience required for the roles they were staffing so we concluded it probably wouldn’t be a great fit.

While at Microsoft I began to learn about machine learning in my free time. In 2015 I’d watched in awe from the sidelines as AlphaGo crushed its human counterparts and I wanted in. I decided that now was the time for me to focus exclusively on machine learning and deep learning in particular.

Would I do it again?

Yup. Alive’s best chance for the long-term was a home at Microsoft. We had a modest number of paying customers but required an order of magnitude more in order to grow  and continue full-time work on Alive. We had grand dreams of bringing Alive to other languages and we wouldn’t have been able to do so without hiring more developers. Microsoft came to us at the perfect time and gave Alive one last shot at success. I’m sad it didn’t work out, but I’m eternally grateful to everyone involved in getting us to Microsoft and to the Visual Studio editor team for being my home for the year.

Machine Learning

August 2017 – Ongoing

The hardest thing about starting something brand new is figuring out where to start. I settled on a mix of linear algebra, Coursera, Kaggle and open-source work.

In September Andrew Ng launched a new deep learning course that covered the following:

I’ve completed the first four and am waiting for the final course on Sequence Models to launch in the coming month. These courses paired nicely with Andrej Karpathy’s CS231n videos. This course will be my go-to answer for the question “I want to learn about neural networks, where do I start?”


I wanted to apply the lessons I learned in Andrew’s videos to my own neural networks. I set out to compete in the introductory Kaggle competition “MNIST Digit Recognizer”. As I learned more and more about neural networks I would apply these lessons to my network and watch the score improve. Being told “batch normalization will improve your results” is one thing, but watching your score tick higher is something else altogether. As of this writing my best submission puts me in the top 25% of submissions with 99.171% accuracy.

My top submission.



I set a personal goal to contribute at least one pull request to TensorFlow so as to better understand the tool I was using. Coming from a .NET Desktop background, there was a bit of a learning curve when it came to tools like bazel and Docker. However, like most things in software development these tools just require a bit of time and focused energy to understand.

I’ve seen mixed success with my contributions. My first pull request was a correction to TensorFlow’s implementation of Inception network. The reviewers agreed that the initial model was incorrect, but are hesitant to change the model due to backward compatibility concerns.

My second pull request improved support for various image operations in TensorFlow. In short, it made it easier to augment multiple images at once. (eg. Randomly flipping images left-to-right). Unfortunately, I introduced some performance regressions and my changes had to be reverted. 😢

My third pull request is a re-implementation of the last, while avoiding the performance regressions. It remains open, but I’m confident that after some work it will be accepted.

On the whole, I’m pleased with the progress I made with TensorFlow. The API surface is massive and I have a lot to learn, but I’m making real, measurable progress. I’ll continue to contribute back code where appropriate.

ICLR 2018 Reproducibility Challenge

At the end of each course, Andrew Ng took time to interview famous names in machine learning such as Geoffrey Hinton and Ian Goodfellow. Shared advice they had for newcomers was “Reproduce papers”. Around the same time, I stumbled upon the ICLR 2018 Reproducibility Challenge where students are challenged to reproduce the results of papers submitted to the ICLR conference.

I signed up and chose the paper “Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates”. This paper proposed a method for training certain neural networks an order of magnitude faster than previous methods allowed. Their approach involved varying the learning rate linearly between (what are typically considered) large values throughout training.

This was the hardest portion of my work thus far and forced me to delve into the details of TensorFlow. In December I made my report available in the comments of their paper’s submission. The TensorFlow portion of my work is available on GitHub at:


My only regret during 2017 was that I published zero blog posts. As such, this was the first year that traffic to my blog decreased.

2017 saw a modest decline in blog traffic


I often tell others to start blogging, and this year my actions didn’t match my words. I attribute my poor track-record to one-part laziness and one-part lack of confidence. It’s surprisingly difficult to work up the courage to write about a subject when you’re brand new to it.

Goals for 2018

  • Follow Jeff Atwood’s advice for bloggers and stick to a schedule for blogging. I want to buckle down and write one blog post a week in 2018
  • Read Ian Goodfellow’s Deep Learning Book
  • Contribute to TensorFlow
  • Compete in a more challenging Kaggle competition
  • Work on HackerRank problems to strengthen my interview skills
  • Get a job related to ML/AI (preferably some kind of research role)