LTFN 4: Intro to Convolutional Neural Networks

Part of the series Learn TensorFlow Now

The neural networks we’ve built so far have had a relatively simple structure. The input to each layer is fully connected to the output of the previous layer. For this reason, these layers are commonly called fully connected layers.

Two fully connected layers in a neural network.

This has been mathematically convenient because we’ve been able to represent each layer’s output as a matrix multiplication of the previous layer’s output (a vector) with the current layer’s weights.

However, as we build more complex networks for image recognition, there are certain properties we want that are difficult to get from fully connected layers. Some of these properties include:

  1. Translational Invariance – A fancy phrase for “A network trained to recognize cats should recognize cats equally well if they’re in the top left of the picture or the bottom right of the picture”. If we move the cat around the image, we should still expect to recognize it.

    Translational invariance suggests we should recognize objects regardless of where they’re located in the image.
  2. Local Connectivity – This means that we should take advantage of features within a certain area of the image. Remember that in previous posts we treated the input as a single row of pixels. This meant that local features (e.g. edges, curves, loops) are very hard for our networks to identify and pick out. Ideally our network should try to identify patterns than occur within local regions of the image and use these patterns to influence its predictions.


Today we’re going look at one of the most successful classes of neural networks: Convolutional Neural Networks. Convolutional Neural Networks have been shown to give us both translational invariance and local connectivity.

The building block of a convolutional neural network is a convolutional filter. It is a square (typically 3x3) set of weights. The convolutional filter looks at pieces of the input of the same shape. As it does, it takes the dot product of the weights with the input and saves the result in the output. The convolutional filter is dragged along the entire input until the entire input has been covered. Below is a simple example with a (random) 5x5 input and a (random) 3x3 filter.


So why is this useful? Consider the following examples with a vertical line in the input and a 3×3 filter with weights chosen specifically to detect vertical edges.

Vertical edge detection from light-to-dark.
Vertical edge detection from dark-to-light.

We can see that with hand-picked weights, we’re able to generate patterns in the output. In this example, light-to-dark transitions produce large positive values while dark-to-light transitions produce large negative values. Where there is no change at all, the filter will simply produce zeroes.

While we’ve chosen the above filter’s weights manually, it turns out that training our network via gradient descent ends up selecting very good weights for these filters. As we add more convolutional layers to our network they begin to be able to recognize more abstract concepts such as faces, whiskers, wheels etc.



You may have noticed that the output above has a smaller width and height than the original input. If we pass this output to another convolutional layer it will continue to shrink. Without dealing with this shrinkage, we’ll find that this puts an upper bound on how many convolutional layers we can have in our network.

SAME Padding

The most common way to deal with this shrinkage is to pad the entire image with enough zeroes such that the output shape will have the same width and height as the input. This is called SAME padding and allows us to continue passing the output to more and more convolutional layers without worrying about shrinking width and height dimensions. Below we take our first example (5×5 input) and pad it with zeroes to make sure the output is still 5×5.


A 5x5 input padded with zeroes to generate a 5x5 output.

VALID Padding

VALID padding does not pad the input with anything. It probably would have made more sense to call it NO padding or NONE padding.

VALID padding results in shrinkage in width and height.



So far we’ve been moving the convolutional filter across the input one pixel at a time. In other words, we’ve been using a stride=1. Stride refers to the number of pixels we move the filter in the width and height dimension every time we compute a dot-product.  The most common stride value is stride=1, but certain algorithms require larger stride values. Below is an example using stride=2.

Notice that larger stride values result in larger decreases in output height and width. Occasionally this is desirable near the start of a network when working with larger images. Smaller input width and height can make the calculations more manageable in deeper layers of the network.


Input Depth

In our previous examples we’ve been working with inputs that have variable height and width dimensions, but no depth dimension. However, some images (e.g. RGB) have depth, and we need some way to account for it. The key is to extend our filter’s depth dimension to match the depth dimension of the input.

Unfortunately, I lack the animation skills to properly show an animated example of this, but the following image may help:

Convolution over an input with a depth of 2 using a single filter with a depth of 2.

Above we have an input of size 5x5x2 and a single filter of size 3x3x2. The filter is dragged across the input and once again the dot product is taken at each point. The difference here is that there are 18 values being added up at each point (9 from each depth of the input image). The result is an output with a single depth dimension.


Output Depth

We can also control the output depth by stacking up multiple convolutional filters. Each filter acts independently of one another while computing its results and then all of the results are stacked together to create the ouptut. This means we can control output depth simply by adding or removing convolutional filters.

Two convolutional filters result in a output depth of two.

It’s very important to note that there are two distinct convolutional filters above. The weights of each convolutional filter are distinct from the weights of the other convolutional filter. Each of these two filters has a shape of 3x3x2. If we wanted to get a deeper output, we could continue stacking more of these 3x3x2 filters on top of one another.

Imagine for a moment that we stacked four convolutional filters on top of one another, each with a set of weights trained to recognize different patterns. One might recognize horizontal edges, one might recognize vertical edges, one might recognize diagonal edges from top-left to bottom-right and one might recognize diagonal edges from bottom-left to top-right. Each of these filters would produce one depth layer of the output with values where their respective edges were detected. Later layers of our network would be able to act on this information and build up even more complex representations of the input.


Next up

There is a lot to process in this post. We’ve seen a brand new building block for our neural networks called the convolutional filter and a myriad of ways to customize it. In the next post we’ll implement our first convolutional neural network in TensorFlow and try to better understand practical ways to use this building block to build a better digit recognizer.

LTFN 3: Deeper Networks

Part of the series Learn TensorFlow Now

In the last post, we saw our network achieve about 60% accuracy. One common way to improve a neural network’s performance is to make it deeper. Before we start adding layers to our network, it’s worth taking a moment to explore one of the key advantages of deep neural networks.

Historically, a lot of effort was invested in crafting hand-engineered features that could be fed to shallow networks (or other learning algorithms). In image detection we might modify the input to highlight horizontal or vertical edges. In voice recognition we might filter out noise or various frequencies not typically found in human speech. Unfortunately, hand-engineering features often required years of expertise and lots of time.

Below is a network created with TensorFlow Playground that demonstrates this point. By feeding modified versions of the input to a shallow network, we are able to train it to recognize a non-linear spiral pattern.

A shallow network requires various modifications to the input features to classify the “Swiss Roll” problem.

A shallow network is capable of learning complex patterns only when fed modified versions of the input. A key idea behind deep learning is to do away with hand-engineered features whenever possible. Instead, by making the network deeper, we can convince the network to learn the features it really needs to solve the problem. In image recognition, the first few layers of the network learn to recognize simple features (eg. edge detection), while deeper layers respond to more complex features (eg. human faces). Below, we’ve made the network deeper and removed all dependencies on additional features.

A deep network is capable of classifying the points in a “Swiss Roll” using only the original input.


Making our network deeper

Let’s try making our network deeper by adding two more layers. We’ll replace layer1_weights and layer1_bias with the following:

Note: When discussing the network’s shapes, I ignore the batch dimension. For example, where a shape is [None, 784] I will refer to it as a vector with 784 elements. I find it helps to imagine a batch size of 1 to avoid having to think about more complex shapes.

The first thing to notice is the change in shape. layer1 now accepts an input of 784 values and produces an intermediate vector layer1_output with 500 elements. We then take these 500 values through layer2 which also produces an intermediate vector layer2_output with 500 elements. Finally, we take these 500 values through layer3 and produce our logit vector with 10 elements.

Why did I choose 500 elements? No reason, it was just an arbitrary value that seemed to work. If you’re following along at home, you could try adding more layers or making them wider (ie. use a size larger than 500).


Another important change is the addition of tf.nn.relu() in layer1 and layer2. Note that it is applied to the result of the matrix multiplication of the previous layer’s output with the current layer’s weights.

So what is a ReLU? ReLU stands for “Rectified Linear Unit” and is an activation function. An activation function is applied to the output of each layer of a neural network. It turns out that if we don’t include activation functions, it can be mathematically shown (by people much smarter than me) that our three layer network is equivalent to a single layer network. This is obviously a BadThing™ as it means we lose all the advantages of building a deep neural network.

I’m (very obviously) glossing over the details here, so if you’re new to neural networks and want to learn more see: Why do you need non-linear activation functions?

Other historical activation functions include sigmoid and tanh. These days, ReLU is almost always the right choice of activation function and we’ll be using it exclusively for our networks.

Graphs for ReLU, sigmoid and tanh functions


Learning Rate

Finally, one other small change needs to be made: The learning rate needs to be changed from 0.01 to 0.0001. Learning rate is one of the most important, but most finicky hyperparameters to choose when training your network. Too small and the network takes a very long time to train, too large and your network doesn’t converge. In later posts we’ll look at methods that can help with this, but for now I’ve just used the ol’ fashioned “Guess and Check” method until I found a learning rate that worked well.


Alchemy of Hyperparameters

We’ve started to see a few hyperparameters that we must choose when building a neural network:

  • Number of layers
  • Width of layers
  • Learning rate

It’s an uncomfortable reality that we have no good way to choose values for these hyperparameters. What’s worse is that we typically can’t explain why a certain hyperparameter value works well and others do not. The only reassurance I can offer is:

  1. Other people think this is a problem
  2. As you build more networks, you’ll develop a rough intuition for choosing hyperparameter values


Putting it all together

Now that we’ve chosen a learning rate and created more intermediate layers, let’s put it all together and see how our network performs.




After running this code you should see output similar to:

Cost:  4596.864
Accuracy:  7.999999821186066 %
Cost:  882.4881
Accuracy:  30.000001192092896 %
Cost:  609.4177
Accuracy:  51.99999809265137 %
Cost:  494.5303
Accuracy:  56.00000023841858 %


Cost:  57.793114
Accuracy:  89.99999761581421 %
Cost:  148.92995
Accuracy:  81.00000023841858 %
Cost:  67.42319
Accuracy:  89.99999761581421 %
Test Cost:  107.98408660641905
Test accuracy:  85.74999994039536 %


Our network has improved from 60% accuracy to 85% accuracy. This is great progress, clearly things are moving in the right direction! Next week we’ll look at a more complicated neural network structure called a “Convolutional Neural Network” which is one of the basic building blocks of today’s top image classifiers.

For the sake of completeness, I’ve included a TensorBoard visualization of the network we’ve created below:

Visualization of our three-layer network with `layer1` expanded. Notice the addition of `layer1_output` following the addition with `layer1_bias`. This represents the ReLU activation function.





LTFN 2: Graphs and Shapes

Part of the series Learn TensorFlow Now

TensorFlow Graphs

Before we improve our network, we have to take a moment to chat about TensorFlow graphs. As we saw in the previous post, we follow two steps when using TensorFlow:

  1. Create a computational graph
  2. Run data through the graph using

Let’s take a look at what’s actually happening when we call Consider our graph and session code from last time:

When we pass optimizer and cost to, TensorFlow looks at the dependencies for these two nodes. For example, we can see above that optimizer depends on:

  • cost
  • layer1_weights
  • layer1_bias
  • input

We can also see that cost depends on:

  • logits
  • labels

When we wish to evaluate optimizer and cost, TensorFlow first runs all the operations defined by the previous nodes, then calculates the required results and returns them. Since every node ends up being a dependency of optimizer and cost, this means that every operation in our TensorFlow graph is executed with every call to

But what if we don’t want to run every operation? If we want to pass test data to our network, we don’t want to run the operations defined by optimizer. (After all, we don’t want to train our network on our test set!) Instead, we’d just want to extract predictions from logits. In that case, we could instead run our network as follows:

This would execute only the subset of nodes required to compute the values of logits, highlighted below:

Our computational graph with only dependencies of logits highlighted in orange.

Note: As labels is not one of the dependencies of logits we don’t need to provide it.

Understanding the dependencies of the computational graphs we create is important. We should always try to be aware of exactly what operations will be running when we call to avoid accidentally running the wrong operations.



Another important topic to understand is how TensorFlow shapes work. In our previous post all our shapes were completely defined. Consider the following tf.Placeholders for input and labels:

We have defined these tensors to have a 2-D shape of precisely (100, 784) and (100, 10). This restricts us to a computational graph that always expects 100 images at a time. What if we have a training set that isn’t divisible by 100? What if we want to test on single images?

The answer is to use dynamic shapes. In places where we’re not sure what shape we would like to support, we just substitute in None. For example, if we want to allow variable batch sizes, we simply write:

Now we can pass in batch sizes of 1, 10, 283 or any other size we’d like. From this point on, we’ll be defining all of our tf.Placeholders in this fashion.



One important question remains: “How well is our network doing?“. In the previous post, we saw cost decreasing, but we had no concrete metric against which we could compare our network. We’ll keep things simple and use accuracy as our metric. We just want to measure the average number of correction predictions:

In the first line, we convert logits to a set of predictions using tf.nn.softmax. Remember that our labels are 1-hot encoded, meaning each one contains 10 numbers, one of which is 1. logits is the same shape, but the values in logits can be almost anything. (eg. values in logits could be -4, 234, 0.5 and so on). We want our predictions to have a few qualities that logits does not possess:

  1. The sum of the values in predictions for a given image should be 1
  2. No values in predictions should be greater than 1
  3. No values in predictions should be negative
  4. The highest value in predictions will be our prediction for a given image. (We can use argmax to find this)

Applying tf.nn.softmax() to logits gives us these desired properties. For more details on softmax, watch this video by Andrew Ng.

The second line takes the argmax of our predictions and of our labels. Then tf.equal creates a vector that contains either True (when the values match) and False when the values don’t match.

Finally, we use tf.reduce_mean to calculate the average number of times we get the prediction correct for this batch. We store this result in accuracy.

Putting it all together

Now that we better understand TensorFlow graphs, shape and have a metric with which to judge our algorithm, let’s put it all together to evaluate our performance on the test set, after training has finished.

Note that almost all of the new code relates to running the test set.

One question you might ask is: Why not just predict all the test images at once, in one big batch of 10,000? The problem is that when we train larger networks on our GPU, we won’t be able to fit all 10,000 images and the required operations in our GPU’s memory. Instead we have to process the test set in batches similar to how we train the network.

Finally, let’s run it and look at the output. When I run it on my local machine I receive the following:

Cost:  20.207457
Accuracy:  7.999999821186066 %
Cost:  10.040323
Accuracy:  14.000000059604645 %
Cost:  8.528659
Accuracy:  14.000000059604645 %
Cost:  6.8867884
Accuracy:  23.999999463558197 %
Cost:  7.1556334
Accuracy:  21.99999988079071 %
Cost:  6.312024
Accuracy:  28.00000011920929 %
Cost:  4.679361
Accuracy:  34.00000035762787 %
Cost:  5.220028
Accuracy:  34.00000035762787 %
Cost:  5.167577
Accuracy:  23.999999463558197 %
Cost:  3.5488296
Accuracy:  40.99999964237213 %
Cost:  3.2974648
Accuracy:  43.00000071525574 %
Cost:  3.532155
Accuracy:  46.99999988079071 %
Cost:  2.9645846
Accuracy:  56.00000023841858 %
Cost:  3.0816755
Accuracy:  46.99999988079071 %
Cost:  3.0201495
Accuracy:  50.999999046325684 %
Cost:  2.7738256
Accuracy:  60.00000238418579 %
Cost:  2.4169116
Accuracy:  55.000001192092896 %
Cost:  1.944017
Accuracy:  60.00000238418579 %
Cost:  3.5998762
Accuracy:  50.0 %
Cost:  2.8526196
Accuracy:  55.000001192092896 %
Test Cost:  2.392377197146416
Test accuracy:  59.48999986052513 %
Press any key to continue . . .

So we’re getting a test accuracy of ~60%. This is better than chance, but it’s not as good as we’d like it to be. In the next post, we’ll look at different ways of improving the network.

LTFN 1: Intro to TensorFlow

Part of the series Learn TensorFlow Now

Over the next few posts, we’ll build a neural network that accurately reads handwritten digits. We’ll go step-by-step, starting with the basics of TensorFlow and ending up with one of the best networks in the ILSCRC 2013 image recognition competition.

MNIST Dataset

The MNIST dataset is one of the simplest image datasets and makes for a perfect starting point. It consists of 70,000 images of handwritten digits. Our goal is to build a neural network that can identify the digit in a given image.

  • 60,000 images in the training set
  • 10,000 images in the test set
  • Size: 28×28 (784 pixels)
  • 1 Channel (ie. not RGB)
Sample images from MNIST

To start, we’ll import TensorFlow and our dataset:

TensorFlow makes it easy for us to download the MNIST dataset and save it locally. Our data has been split into a training set on which our network will learn and a test set against which we’ll check how well we’ve done.

Note: The labels are represented using one-hot encoding which means:

0 is represented by 1 0 0 0 0 0 0 0 0 0
1 is represented by 0 1 0 0 0 0 0 0 0 0

9 is represented by 0 0 0 0 0 0 0 0 0 1

Note: By default, the images are represented as arrays of 784 values. Below is a sample of what this might look like for a given image:

TensorFlow Graphs

There are two steps to follow when training our own neural networks with TensorFlow:

  1. Create a computational graph
  2. Run data through the graph so our network can learn or make predictions

Creating a Computational Graph

We’ll start by creating the simplest possible computational graph. Notice in the following code that there is nothing that touches the actual MNIST data. We are simply creating a computational graph so that we may later feed our data to it.

For first-time TensorFlow users there’s a lot to unpack in the next few lines, so we’ll take it slow.

Before explaining anything, let’s take a quick look at the network we’ve created. Below are two different visualizations of this network at different granularities that tell slightly different stories about what we’ve created.

Left: A functional visualization of our single layer network. The 784 input values are each multiplied by a weight which feeds into our ten logits.
Right: The graph created by TensorFlow, including nodes that represent our optimizer and cost.The first two lines of our code simply define a TensorFlow graph and tell TensorFlow that all the following operations we define should be included in this graph.

Next, we use tf.Placeholder to create two “Placeholder” nodes in our graph. These are nodes for which we’ll provide values every time we run our network. Our placeholders are:

  • input which will contain batches of 100 images, each with 784 values
  • labels which will contain batches of 100 labels, each with 10 values

Next we use tf.Variable to create two new nodes, layer1_weights and layer1_biases. These represent parameters that the network will adjust as we show it more and more examples. To start, we’ve made layer1_weights completely random, and layer1_biases all zero. As we learn more about neural networks, we’ll see that these aren’t the greatest choice, but they’ll work for now.

After creating our weights, we’ll combine them using tf.matmul to matrix multiply them against our input and + to add this result to our bias. You should note that + is simply a convenient shorthand for tf.add.  We store the result of this operation in logits and will consider the output node with the highest value to be our network’s prediction for a given example.

Now that we’ve got our predictions, we want to compare them to the labels and determine how far off we were. We’ll do this by taking the softmax of our output and then use cross entropy as our measure of “loss” or cost. We can perform both of these steps using tf.nn.softmax_cross_entropy_with_logits. Now we’ve got a measure of loss for all the examples in our batch, so we’ll just take the mean of these as our final cost.

The final step is to define an optimizer. This creates a node that is responsible for automatically updating the tf.Variables (weights and biases) of our network in an effort to minimize cost. We’re going to use the vanilla of optimizers: tf.train.GradientDescentOptimizer. Note that we have to provide a learning_rate to our optimizer. Choosing an appropriate learning rate is one of the difficult parts of training any new network. For now we’ll arbitrarily use 0.01 because it seems to work reasonably well.


Running our Neural Network

Now that we’ve created the network it’s time to actually run it. We’ll pass 100 images and labels to our network and watch as the cost decreases.

The first line creates a TensorFlow Session for our graph. The session is used to actually run the operations defined in our graph and produce results for us.

The second line initializes all of our tf.Variables. In our example, this means choosing random values for layer1_weights and setting layer1_bias to all zeros.

Next, we create a loop that will run for 1,000 training steps with a batch_size of 100. The first three lines of the loop simply select out 100 images and labels at a time. We store batch_images and batch_labels in feed_dict. Note that the keys of this dictionary input and labels correspond to the tf.Placeholder nodes we defined when creating our graph. These names must match, and all placeholders must have a corresponding entry in feed_dict.

Finally, we run the network using where we pass in feed_dict. Notice that we also pass in optimizer and cost. This tells TensorFlow to evaluate these nodes and to store the results from the current run in o and c. In the next post, we’ll touch more on this method, and how TensorFlow executes operations based on the nodes we supply to it here.


Now that we’ve put it all together, let’s look at the (truncated) output:

 Cost: 12.673884
 Cost: 11.534428
 Cost: 8.510129
 Cost: 9.842179
 Cost: 11.445622
 Cost: 8.554568
 Cost: 9.342157
 Cost: 4.811098
 Cost: 4.2431364
 Cost: 3.4888883
 Cost: 3.8150232
 Cost: 4.206609
 Cost: 3.2540445

Clearly the cost is going down, but we still have many unanswered questions:

  • What is the accuracy of our trained network?
  • How do we know when to stop training? Was 1,000 steps enough?
  • How can we improve our network?
  • How can we see what its predictions actually were?

We’ll explore these questions in the next few posts as we seek to improve our performance.

Complete source


2017: A Retrospective

2017 saw a lot of change for me. I left Microsoft, returned to Toronto and shifted my focus from developer tools to machine learning. Following in patio11’s footsteps, I wanted to take a moment to reflect on the year and clarify my thoughts in written form.


July 2016 – July 2017

In July 2016, my company Code Connect joined Microsoft with the stated goal of integrating our Alive extension into Visual Studio 2017. Unfortunately our visas were delayed and we weren’t able to begin work until early September. It didn’t make sense to rush Alive into Visual Studio 2017 and risk introducing stability problems so we held off for a few months.

In the meantime, we conducted a number of experiments to try to quantify demand for Alive so we could compare it to other potential features. When the experiments concluded, the data didn’t demonstrate an immediate need for a product like Alive. Alive was put on ice and we worked on features that were immediately pressing to Visual Studio’s success. At the time this meant focusing on accessibility, a top-down directive from Satya Nadella himself.

After Alive was sunset, I did some reflection on where I wanted to take my career and what I wanted to focus on. For a long time I’ve worred that the knowledge I’ve accumulated writing Visual Studio extensions is not immediately applicable outside of Visual Studio. Despite working on developer tools for almost five years, I would be on near-equal footing with a junior developer when it comes to developing extensions for VS Code or Eclipse. Much of my expertise comes in the form of random bits of trivia about Visual Studio. The problems I was faced with were challenging, but not very interesting.

The more I complained, the more I realized something had to change. In July 2017 I left Microsoft and applied for a position at a company for which I’d previously been an intern. However I didn’t have the C++ experience required for the roles they were staffing so we concluded it probably wouldn’t be a great fit.

While at Microsoft I began to learn about machine learning in my free time. In 2015 I’d watched in awe from the sidelines as AlphaGo crushed its human counterparts and I wanted in. I decided that now was the time for me to focus exclusively on machine learning and deep learning in particular.

Would I do it again?

Yup. Alive’s best chance for the long-term was a home at Microsoft. We had a modest number of paying customers but required an order of magnitude more in order to grow  and continue full-time work on Alive. We had grand dreams of bringing Alive to other languages and we wouldn’t have been able to do so without hiring more developers. Microsoft came to us at the perfect time and gave Alive one last shot at success. I’m sad it didn’t work out, but I’m eternally grateful to everyone involved in getting us to Microsoft and to the Visual Studio editor team for being my home for the year.

Machine Learning

August 2017 – Ongoing

The hardest thing about starting something brand new is figuring out where to start. I settled on a mix of linear algebra, Coursera, Kaggle and open-source work.

In September Andrew Ng launched a new deep learning course that covered the following:

I’ve completed the first four and am waiting for the final course on Sequence Models to launch in the coming month. These courses paired nicely with Andrej Karpathy’s CS231n videos. This course will be my go-to answer for the question “I want to learn about neural networks, where do I start?”


I wanted to apply the lessons I learned in Andrew’s videos to my own neural networks. I set out to compete in the introductory Kaggle competition “MNIST Digit Recognizer”. As I learned more and more about neural networks I would apply these lessons to my network and watch the score improve. Being told “batch normalization will improve your results” is one thing, but watching your score tick higher is something else altogether. As of this writing my best submission puts me in the top 25% of submissions with 99.171% accuracy.

My top submission.



I set a personal goal to contribute at least one pull request to TensorFlow so as to better understand the tool I was using. Coming from a .NET Desktop background, there was a bit of a learning curve when it came to tools like bazel and Docker. However, like most things in software development these tools just require a bit of time and focused energy to understand.

I’ve seen mixed success with my contributions. My first pull request was a correction to TensorFlow’s implementation of Inception network. The reviewers agreed that the initial model was incorrect, but are hesitant to change the model due to backward compatibility concerns.

My second pull request improved support for various image operations in TensorFlow. In short, it made it easier to augment multiple images at once. (eg. Randomly flipping images left-to-right). Unfortunately, I introduced some performance regressions and my changes had to be reverted. 😢

My third pull request is a re-implementation of the last, while avoiding the performance regressions. It remains open, but I’m confident that after some work it will be accepted.

On the whole, I’m pleased with the progress I made with TensorFlow. The API surface is massive and I have a lot to learn, but I’m making real, measurable progress. I’ll continue to contribute back code where appropriate.

ICLR 2018 Reproducibility Challenge

At the end of each course, Andrew Ng took time to interview famous names in machine learning such as Geoffrey Hinton and Ian Goodfellow. Shared advice they had for newcomers was “Reproduce papers”. Around the same time, I stumbled upon the ICLR 2018 Reproducibility Challenge where students are challenged to reproduce the results of papers submitted to the ICLR conference.

I signed up and chose the paper “Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates”. This paper proposed a method for training certain neural networks an order of magnitude faster than previous methods allowed. Their approach involved varying the learning rate linearly between (what are typically considered) large values throughout training.

This was the hardest portion of my work thus far and forced me to delve into the details of TensorFlow. In December I made my report available in the comments of their paper’s submission. The TensorFlow portion of my work is available on GitHub at:


My only regret during 2017 was that I published zero blog posts. As such, this was the first year that traffic to my blog decreased.

2017 saw a modest decline in blog traffic


I often tell others to start blogging, and this year my actions didn’t match my words. I attribute my poor track-record to one-part laziness and one-part lack of confidence. It’s surprisingly difficult to work up the courage to write about a subject when you’re brand new to it.

Goals for 2018

  • Follow Jeff Atwood’s advice for bloggers and stick to a schedule for blogging. I want to buckle down and write one blog post a week in 2018
  • Read Ian Goodfellow’s Deep Learning Book
  • Contribute to TensorFlow
  • Compete in a more challenging Kaggle competition
  • Work on HackerRank problems to strengthen my interview skills
  • Get a job related to ML/AI (preferably some kind of research role)



EnC Part 3 – The CLR

In the last post, we looked at using Roslyn to generate deltas between two compilations. Today we’ll take a look at how we can apply these deltas to a running process.


If you dig through Microsoft’s .NET Reference source occasionally you’ll come across extern methods like FastAllocateString() decorated with a special attribute: [MethodImplAttribute(MethodImplOptions.InternalCall)]. These are entry points to the CLR that can be called from managed code. Calling into the CLR can be done for a number of reasons. In the case of FastAllocateString it’s to implement certain functionality in native code for performance (in this case without even the overhead of P/Invoke). Other entry points are exposed to trigger CLR behavior like garbage collection or to apply deltas to a running process.

When I started this project I wasn’t even aware the CLR had an API. Fortunately Microsoft has recently released internal documentation that explains much of the CLR’s behavior including these APIs. Mscorlib and Calling Into the Runtime documents the differences between FCall, QCall and P/Invoke as entrypoints to the CLR from managed code.

Managing the many methods, classes and interfaces is a huge pain and too much work to do manually when starting out. Luckily Microsoft has released a managed wrapper that makes a lot of this stuff easier to work with. The Managed Debug Sample (mdbg) has everything we’ll need to attach to a process and apply changes to it.

The sample has a few extra projects. For our purposes we’ll need:

  • corapi – The managed API we’ll interact with directly
  • raw – Set of interfaces and COMImports over the ICorDebug API
  • NativeDebugWrappers – Functionality for low level Windows debugging

Game Plan

At a high level, our approach is going to be the following:

  1. Create an instance of CorDebugger, a debugger we can use to create and attach itself to other processes.
  2. Start a remote process
  3. Intercept loading of modules and mark them for Edit and Continue
  4. Apply deltas


Creating an instance of the debugger is fairly involved. We first have to get an instance of the CLR host based on the version of the runtime we’re interested in (in our case anything after v4.0 will work). Working with the managed API is still awkward, certain types are created based on GUIDs that seem to be undocumented outside of sample code. Nonetheless the following code creates an instance of a managed debugger we can use.

In the following we get a list of runtimes available from the currently running process. I can’t offer insight into whether this is “good” or “bad” but it’s something to be aware of.

Starting the process

Once we’ve got a hold of our debugger, we can use it to start a process. While working on this I learned that we (in the .NET world) have been shielded from some of the peculiarities of creating a process on Windows. These peculiarities start to bleed through when creating processes with our custom debugger.

For example, if we want to send the argument 123456 to our new process, it turns our we have to send the process’ filename as the first argument as well. So the call to ICorDebug::CreateProcess(string applicationName, string commandLine) ends up looking something like

For more on this Mike Stall has a post on Conventions for passing the arguments to a process.

We also have to manually pass process flags when creating our process. These flags dictate various properties for our new process (Should a new window be created? Should we debug child processes of this process? etc.). Below we start a process, assuming that the application is in the current directory.

Mark Modules for Edit and Continue

By default the CLR doesn’t expect that EnC will be enabled. In order to enable it, we’ll have to manually set JIT flags on each module we’re interested in. CorDebug exposes an event that signals when a module has been loaded, so we’ll use this to control the flags.

A sample event handler for module loading might look like:

Notice in the above that we’re only setting the flag for the module we’re interested in. If we try to set the JIT flags for all modules we’ll run into exceptions when working with NGen-ed modules. The exception is a little cryptic and complains about “Zap Modules” but this turns out just to be the internal name for NGen modules.

Applying the Deltas

Finally. After three blog posts we’ve arrived at the point: Actually manipulating the running process.

In truth, we don’t apply our changes directly to the process, but to an individual module within it. So our first task is to find the individual module we’re want to change. We can search through all AppDomains, assemblies and modules to find the module with the correct name.

Once we find the module we want to request metadata about the module from it. This turns out to be a weird implementation detail in which the CLR assumes you can’t possible want to apply changes unless you’ve requested this info previously. We put this all together into the following:


I should at least touch on one more aspect of EnC I’ve glossed over thus far: remapping. If you are changing a method that has currently active statements, you will be given an opportunity to remap the current “Instruction Pointer” based on line number. It’s up to you to decide on which line execution should resume. The CorDebugger exposes OnFunctionRemapOpportunity and OnFunctionRemapComplete as events that allow you to guide remapping.

Here’s a sample remapping event handler:

We’ve now got all the pieces necessary to manipulate a running process and a good base to build off of. Complete code for today’s blog post can be found here on GitHub. Leave any questions in the comments and I’ll do my best to answer them or direct you to someone who can at Microsoft.

Edit and Continue Part 2 – Roslyn

Our first task is to coerce Roslyn to emit metadata and IL deltas between between two compilations. I say coerce because we’ll have to do quite a bit of work to get things working. The Compilation.EmitDifference() API is marked as public, but I’m fairly sure it’s yet to be actually used by the public. Getting everything to work requires reflection and manual copying of Roslyn code that doesn’t ship via NuGet.

The first order of business is to figure out what it takes to call Compilation.EmitDifference() in the first place. What parameters are we expected to provide? The signature:

So based on the above, the two input arguments that we need to worry about are EmitBasline and IEnumerable<SemanticEdit>. We’ll approach these one at a time.


An EmitBaseline represents a module created from a previous compilation. Modules live inside of assemblies and for our purposes it’s safe to assume that every module relates one-to-one with an assembly. (In reality multi-module assemblies can exist, but neither Visual Studio nor MSBuild support their creation). For more see this StackOverflow question.

We’ll look at the EmitBaseline as representing an assembly created from a previous compilation. We want to create a baseline to represent the initial compiled assembly before any changes are made to it. Roslyn can compare this baseline to new compilations we create.

An baseline can be created via EmitBaseline.CreateInitialBaseline()

Now we’ve got two more problems: ModuleMetadata and a function that maps between MethodDefinitionHandle and EditAndContinueMethodDebugInformation.

ModuleMetadata simply represents summary information about our module/assembly. Thankfully we can create it easily by passing our initial assembly to either ModuleMetadata.CreateFromFile (for assemblies on disk) or ModuleMetadata.CreateFromStream (for assemblies in memory).

Func<MethodDefinitionHandle, EditAndContinueMethodDebugInformation> proves much harder to work with. This function maps between methods and various debug information including a method’s local variable slots, lambdas and closures. This information can be generated by reading .pdb symbol files. Unfortunately there’s no public API for generating this function. What’s worse is that we’ll have to use test APIs that don’t even ship via NuGet so even Reflection is out of the question.

Instead we’ll have to piece together bits of code from Roslyn’s test utilities. Ultimately this requires that we copy code from the following files:

We’ll also need to include two NuGet packages:

It’s a bit of a pain that we need to bring so much of Roslyn with us just for the sake of one file. It’s sort of like working with a ball of yarn; you pull on one string and the whole thing comes with it.

The SymReaderFactory coupled with the DiaSymReader packages can interpret debug information from Microsoft’s PDB format. Once we’ve copied these files to our project we can use the SymReaderFactory to create a debug information provider by feeding the PDB stream to SymReaderFactory.CreateReader().


SemanticEdits describe the differences between compilations at the symbol level. For example, modifying a method will introduce a SemanticEdit for the corresponding IMethodSymbol marking is as updated. Roslyn will end up converting these SemanticEdits into proper IL and metadata deltas.

It turns out SemanticEdit is a public class. The problem is that they’re difficult to generate properly. We have to diff Documents across different versions of a Solution which means we have to take into account changes in syntax, trivia and semantics. We also have to detect invalid changes which aren’t (to my knowledge) officially or completely documented anywhere. In this Roslyn issue, I propose three potential approaches to generating the edits, but we’ll only take a look at the one I’ve implemented myself: using the internal CSharpEditAndContinueAnalyzer.

The CSharpEditAndContinueAnalyzer and its base class method AnalyzeDocumentAsync will generate a DocumentAnalysisResult with our edits along with some supplementary information about the changes. Were there errors? Were the changes substantial? Were there special areas of interest such as catch or finally blocks?

Since these classes are internal we’ll have to use Reflection to get at them. We’ll also need to keep a copy of the Solution around with which we used to generate our EmitBaseline. I’ve put all of the code together into a complete sample. The reflection based approach for CSharpEditAndContinueAnalyzer is demonstrated in the GetSemanticEdits method below.

We can see that this is quite a bit of work just to build the edits. In the above sample we made a number of simplifying assumptions. We assumed there were no errors in the compilation, that there were no illegal edits and no active statements. It’s important to cover all cases if you plan to consume this API properly.

Our next step will be to apply these deltas to a running process using APIs exposed by the CLR.