Microsoft Malware Prediction

Last month I participated in my first Kaggle competition: Microsoft Malware Prediction. Competitors were given a single train.csv that contained information about millions of people’s PCs. This information included things like screen size, operating system version and various security settings. Competitors were tasked with predicting whether or not a given computer was likely to have been infected with malware.

Exploring the data

Seeing as this was my first Kaggle competition and I didn’t really know what I was doing I started off by heading to look at other user’s public kernels. This is where users share tricks and tips that worked for them. While you’re obviously not going find a solution that wins you the competition, it’s the perfect place for a beginner like myself to start out.

My personal favorite kernel was “My EDA – I want to see all“. This kernel allowed me to look at individual columns in our dataset and gain some intuition on whether or not they might be useful. For example, here is a plot for the Platform column:

992e5200ba38284433075f18a4596312

This plot clearly shows that windows2016 is probably a category that might have some degree of predictive power. However this plot also tells us that there are just 14,000 examples of windows2016 out of a total ~9,000,000 so it’s not going to be super useful. Being able to look at columns quickly like this feels like it will be super useful in future competitions and I have incorporated this approach into my KaggleUtils library.

Another interesting discussion demonstrated how we could map Windows Defender versions to dates. This feature helped other users uncover that the test set and train set came from different points in time, which made validation difficult. Both myself and others would see improvements on our own local validation sets, but little or no improvements when we submitted our predictions to the leaderboard. This highlighted the biggest lesson I’ve learned so far:

Creating a good validation set should be my first priority.

On this competition (and a few others I’ve participated in since) this one lesson keeps coming up again and again: You cannot build a good model without a good validation set. It is hard to do, it is tedious to do, but without it you are going into each contest blind.

Cleaning the Data

After exploring the data and getting a sense of what each column meant, it was time to clean the data. This meant dropping columns that wouldn’t help us predict whether a computer had malware. In my case this meant columns that were entirely unique (MachineId) or columns that had >99.99% the same value.

I also dropped categories within columns if those categories were not present in the test set. My thoughts were that if a category wasn’t present in the test set (say a version of Windows Defender) then coming to depend on it for our predictions would hurt performance in the test set.

Finally, I grouped categories that had very small (< 1000) examples in the training set. I figured categories with few examples wouldn’t help us gain much of a predictive edge so I grouped all of these tiny categories into a single OTHER category.

Next I incorporated dates into my training set. After mapping Windows Defender version to dates, I broke those dates into categories like Month, Day and DayOfWeek. I don’t believe this helped my model but it’s an approach I would use in future competitions where predictions might depend on these sorts of things (For example predicting a store’s revenue might depend on whether or not it’s payday).

Model

While everyone successful in this competition used an LGBM, I chose to use a neural network simply because it was what I was most familiar with. I actually started an LGBM approach of my own but didn’t complete it. I wish I had seen it through because it would have helped me understand if my data cleaning approaches gave me any kind of edge.

The data for this competition was largely categorical so LGBM was very well suited to the task. I used neural networks simply because that was what I was familiar with after recently going through fast.ai’s course.

Future Improvements

There was a lot of room for improvement in how I approached this problem and ran my experiments.

  • Create a validation set with the same distribution as the test set
    • By far the most important change would be to create a validation set with the proper distribution. Some of the kernels even demonstrated how to do this, but out of a personal laziness I ignored them.
  • Reproducible Runs
    • While exploring hyperparameters I would often just make arbitrary changes I hoped would improve things. Despite using source control I had very little infrastructure to make reproducible runs. I think using this approach would really help me handle my changes in a more manageable way.
  • Manage Memory
    • Running out of memory was an ongoing problem in this competition. Once EDA was complete I should have focused on shrinking the data’s size. Alternatively I should just buy more RAM.
  • Alternative Models
    • Although I’m glad I explored my approach with neural networks it was clear that LGBM models were superior on this task. I should have at least tried them out and submitted some predictions.
  • Utilities
    • Moving forward it would be useful to have a set of common tools for data exploration and cleaning. Many EDA tasks are similar across different competitions so having a semi-standardized approach seems like it would be a huge help. For this reason I’ve started working on a set of Kaggle Utilities.

Conclusion

Ultimately I scored 1,309th out of 2,426 competitors so there was nothing to write home about. I am overall happy with the experience and I’m confident on future competitions I can improve my scores and placements by being diligent and grinding through some of the tedious elements of data science.

Kaggle Course: Week 2 Notes

Exploratory Data Analysis

While it’s tempting to just throw all our columns into a model, it’s worth our time to try to understand the data we’re modelling. We should:

  • Look for outliers
  • Look for errors. Consider adding Is_Incorrect column to mark rows with errors
  • Try to figure out how the data was generated
    • eg. Microsoft Malware competition, the test set comes from time in the future. This made many version columns mostly useless for training
  • Look at data
    • df.head().T
    • For dates, find min, max and number of days for test and train

Visualizations for Individual Features

  • Histograms:
    • plt.hist(x) or plt.hist(x)
    • df['Column'].hist()
  • Plot Index vs. Values
    • plt.plot(x, target)
    • Look for patterns (data might not be shuffled)
  • Look at statistics for the data
    • df.describe()
    • x.mean()
    • x.var()
    • x.value_counts()
    • x.isnull()

Visualizations for Feature Interactions

  • Compare two features against one another
    • plt.scatter(x1, x2)
    • We can use this to check whether or not the distributions look the same in both the test and train set
    • Can be used to help generate new features
  • Compare multiple feature pairs against one another
    • pd.scatter_matrix(df)
  • Correlation between columns
    • df.corr()
    • plt.matshow()
    • Consider running K-Means clustering to order the columns first

Dataset Cleaning

  • Remove duplicate columns or columns with constant values
    • train.nunique(axis=1) == 1
    • train.T.drop_duplicates()
  • Remove duplicate categoricals by label encoding first then dropping:
    • for f in categorical_features:
        train[f] = train[f].factorize()
      traintest.T.drop_duplicates()
  • Look for duplicate rows with different target values. Might be a mistake.
  • Look for identical rows in both train and test
  • Check if dataset is shuffled
    • Plot feature vs Row Index

Validation Strategies

There are a few different approaches we can use to create validation sets:

  • Holdout
    • Carve out a chunk of the dataset and test on it
    • Good when we have a large, evenly balanced dataset
    • sklearn.model_selection.ShuffleSplit
  • KFold
    • Split the dataset into chunks, use each chunk as a holdout (training from scratch each time) and average the scores
    • Good for medium sized database
    • Stratification enforces identical target distribution over each fold
    • sklearn.model_selection.Kfold
  • Leave-one-out
    • Choose a single example for validation, train on others, repeat.
    • Good for very small datasets
    • sklearn.model_selection.LeaveOneOut

Choosing Data for Validation

We need to choose how to select values for our validation set. We want to build our validation set the same way as the organizers built their test set.

  • Random, row-wise
    • Simply select random values until our validation set is filled up
  • Timewise
    • Remove the final chunk based on time
    • Used for time-series contests
  • By Id
    • Sometimes the dataset contains multiple observations for multiple IDs and the test set will consist of IDs we’ve never seen before. In this case, we want to carve out a validation set of IDs that we don’t train on.
  • Combination
    • Some combination of the above

Problems with Validation

Sometimes improvements on our validation set don’t improve our test score.

  • If we have high variance in our validation scores, we should do extensive validation
    • Average scores from different KFold splits
    • Tune model on one split, evaluate score on another split
  • If test scores don’t match validation score
    • Check if there is too little data in public test set
    • Check if we’ve overfit
    • Check if we chose proper splitting strategy
    • Check if train/test have different distributions

Kaggle Course: Week 1 Notes

Recently I’ve been playing around with my first Kaggle competition. At the same time I’ve been going through the video material from Coursera’s How to Win a Data Science Competition. While all of the lectures are useful, a few contain specific, actionable pieces of advice that I’d like to catalog here.

Feature Pre-Processing and Generation

Numeric

Frequently we would like to scale numeric inputs to our models in order to make learning easier. This is especially useful for non-tree based models. The most common ways to scale our numeric inputs are:

  • MinMax – Scales values between 0 and 1.
    • sklearn.preprocessing.MinMaxScaler
  • Normalization – Scales values to have mean=0 and std=1
    • Good for neural networks
    • sklearn.preprocesssing.StandardScaler
  • Log Transform
    • Good for neural networks
    • ​​np.log(1+x)

We should look for outliers by plotting values. After finding them:

  • Clip our values between a chosen range. (eg. 1st and 99th percentile)
    • np.clip(x, UPPERBOUND, LOWERBOUND)
  • Rank
    • Simply order the numeric values. Automatically deals with outliers.
    • scipy.stats.rankdata.rank

Keep in mind we can train different models on differently scaled/pre-processed data and then mix the models together. This might help if we’re not 100% sure which pre-processing steps are best.

Categorical

Most models need categorical data to be encoded in some way before the model can work with it. That is, we can’t just feed category strings to a model, we have to convert them to some number or vector first.

  • Label Encoding
    • Simply assign each category a number
    • sklearn.preprocessing.LabelEncoder
    • Pandas.factorize
  • Frequency Encoding
    • Give each category a number based on how many times it appears in the combined train and test sets
    • eg. Map to a percentage then optionally rank
  • One-hot Encoding
    • Make a new column for each category with a single 1 and all other 0
    • Good for neural networks
    • pandas.get_dummies
    • ​​sklearn.preprocessing.OneHotEncoder

DateTime

We can add lots of useful relationships that most (all?) models struggle to capture.

  • Capture Periodicity
    • Break apart dates into Year, Day in Week, Day in Year etc.
  • Time-Since a particular event
    • Seconds passed since Jan 1, 1970
    • Days since last holiday
  • Difference between dates

Missing Values

  • Typically replace missing numerics with extreme values (-999), mean or median
  • Might already be replaced in the dataset
  • Adding IsNan feature can be useful
  • Replace nans after feature generation
    • We don’t want the replaced nans to have any impact on means or other features we create
  • Some frameworks/algorithms can handle nans
    • fastai
    • xgboost

 

 

 

Image Classification: Counting Part II

Full notebook on GitHub.

In Part I, we saw a few examples of image classification. In particular counting objects seemed to be difficult for convolutional neural networks. After sharing my work on the fast.ai forums, I received a few suggestions and requests for further investigation.

The most common were:

  • Some transforms seemed uneccessary (eg. crop and zoom)
  • Some transforms might be more useful (eg. vertical flip)
  • Consider training the model from scratch (inputs come from a different distribution)
  • Try with more data
  • Try with different sizes

Sensible Transforms

After regenerating our data we can look at it:

Looks better, all transforms keep the dots centered and identical

Now we can create a learner and train it on this new dataset.

Which gives us the following output:

epoch train_loss valid_loss error_rate
1 0.881368 1.027981 0.425400
2 0.522674 3.760669 0.758600
14 0.003345 0.000208 0.000000
15 0.002617 0.000035 0.000000

Wow! Look at that, this time we’re getting 100% accuracy. It looks like if we throw enough data at it (and use proper transforms) this is a problem that can actually be trivially solved by convolutional neural networks. I honestly did not expect that at all going into this.

Different Sizes of Objects

One drawback of our previous dataset is that the objects we’re counting are all the same size. Is it possible this is making the task too easy? Let’s try creating a dataset with circles of various sizes.

Which allows us to create images that look something like:

Objects of various sizes

Once again we can create a dataset this way and train a convolutional learner on it.  Complete code on GitHub.

Results:

1 1.075099 0.807987 0.381000
2 0.613711 5.742334 0.796600
14 0.009446 0.000067 0.000000
15 0.001920 0.000075 0.000000

Still works! Once again I’m surprised. I had very little hope for this problem but these networks seem to have absolutely no issue with solving this.

This runs completely contrary to my expectations. I didn’t think we could count objects by classifying images. I should note that the network isn’t “counting” anything here, it’s simply putting each image into the class it thinks it would belong to. For example, if we showed it an example with 10 images, it would have to classify it as either “45”, “46”, “47”, “48” or “49”.

More generally, counting would probably make more sense as a regression problem than a classification problem. Still, this could be useful when trying to distinguish between object counts of a fixed and guaranteed range.

Image Classification with fastai

Over the last year I focused on what some call a “bottom-up” approach to studying deep learning. I reviewed linear algebra and calculus. I read Ian Goodfellow’s book “Deep Learning”. I built AlexNet, VGG and Inception architectures with TensorFlow.

While this approach helped me learn the bits and bytes of deep learning, I often felt too caught up in the details to create anything useful. For example, when reproducing a paper on superconvergence, I built my own ResNet from scratch. Instead of spending time running useful experiments, I found myself debugging my implementation and constantly unsure if I’d made some small mistake. It now looks like I did make some sort of implementation error as the paper was successfully reproduced by fast.ai and integrated into fast.ai’s framework for deep learning.

With all of this weighing on my mind I found it interesting that fast.ai advertised a “top-down” approach to deep learning. Instead of starting with the nuts and bolts of deep learning, they instead first seek to answer the question “How can you make the best/most accurate deep learning system?” and structure their course around this question.

The first lesson focuses on image classification via transfer learning. They provide a pre-trained ResNet-34 network that has learned weights using the ImageNet dataset. This has allowed it to learn various things about the natural world such as the existence of edges, corners, patterns and text.

Image result for convolutional layer visualization
Visualization of things early layers learn to respond to. Taken from Visualizing and Understanding Convolutional Networks

After creating a competent pet classifier they recommend that students go out and try to use the same approach on a dataset of their own creation. For my part I’ve decided to try their approach on three different datasets, each chosen to be slightly more challenging than the last:

  1. Impressionist Paintings vs. Modernist Paintings
  2. Kittens vs. Cats
  3. Counting objects

Paintings

Full notebook on GitHub.

Our first step is simply to import everything that we’ll need from the fastai library:

Next we’ll take a look at the data itself. I’ve saved it in data/paintings. We’ll create an ImageDataBunch which automatically knows how to read labels for our data based off the folder structure. It also automatically creates a validation set for us.

A few sample images from our dataset

Looking at the above images, it’s fairly easy to differentiate the solid lines of modernism from the soft edges and brush strokes of impressionist paintings. My hope is that this task will be just as easy for a pre-trained neural network that can already recognize  edges and identify repeated patterns.

Now that we’ve prepped our dataset, we’ll prepare a learner and let it train for five epochs to get a sense of how well it does.

epoch train_loss valid_loss error_rate
1 0.976094 0.502022 0.225000
2 0.683104 0.202733 0.100000
3 0.488111 0.158647 0.100000
4 0.383773 0.142937 0.050000
5 0.321568 0.141001 0.050000

Looking good! With virtually no effort at all we have a classifier that reaches 95% accuracy. This task proved to be just as easy as expected. In the notebook we take things a further by choosing better learning rate and training for a little while longer before ultimately getting 100% accuracy.

Cats vs. Kittens

Full notebook on GitHub.

The painting task ended up being as easy as we expected. For our second challenge we’re going to look at a dataset of about 180 cats and 180 kittens. Cats and kittens share many features (fur, whiskers, ears etc.) which seems like it would make this task harder. That said, a human can look at pictures of cats and kittens and easily differentiate between them.

This time our data is located in data/kittencat so we’ll go ahead and load it up.

Sample images from our kittens vs. cats dataset

Once again, let’s try a standard fastai CNN learner and run it for about 5 epochs to get a sense for how it’s doing.

epoch train_loss valid_loss error_rate
1 0.887721 0.633843 0.378788
2 0.732651 0.336768 0.136364
3 0.569540 0.282584 0.136364
4 0.492754 0.278653 0.151515
5 0.425181 0.280318 0.136364

So we’re looking at about 86% accuracy. Not quite the 95% we saw when classifying paintings but perhaps we can push it a little higher by choosing a good learning rate and running our model for longer.

Below we are going to use the “Learning Rate Finder” to (surprise, surprise) find a good learning rate. We’re looking for portions of the plot in which the graph steadily decreased.

Results of our learning rate finder

It looks like there is a sweetspot between 1e-5 and 1e-3. We’ll shoot for the ‘middle’ and just use 1e-4. We’ll also run for 15 epochs this time to allow more time for learning.

epoch train_loss valid_loss error_rate
1 0.216681 0.285061 0.121212
2 0.228469 0.287646 0.121212
14 0.148541 0.216946 0.075758
15 0.141137 0.215242 0.075758

Not bad! With a little bit of learning rate tuning, we were able to get a validation accuracy of about 92% which is much better than I expected considering we had less than 200 examples of each class. I imagine if we collected a larger dataset we could do even better.

Counting Objects

Full notebook on GitHub.

For my last task I wanted to see whether or not we could train a ResNet to “count” identical objects. So far we have seen that these networks excel at distinguishing between different objects, but can these networks also identify multiple occurrences of something?

Note: I specifically chose this task because I don’t believe it should be possible for a vanilla ResNet to accomplish this task. A typical convolutional network is set up to differentiate between classes based on the features of those classes, but there is nothing in a convolutional network that suggests to me that it should be able to count objects with identical features.

For this challenge we are going to synthesize our own dataset using matplotlib. We’ll simply generate plots with the correct number of circles in them as shown below:

An example of a generated image

There are some things to note here:

  1. When we create a dataset like this, we’re in uncharted territory as far as the pre-trained weights are concerned. Our network was trained on photographs of the natural world and expects its inputs to come from this distribution. We’re providing inputs from a completely different distribution (not necessarily a harder one!) so I wouldn’t expect transfer learning to work as flawlessly as it did in previous examples.
  2. Our dataset might be trivially easy to learn. For example, if we wrote an algorithm that simply counted the number of “blue” pixels we could very accurately figure out how many circles were present as all circles are the same size.

We don’t need to hypothesize any further, though. We can just create our ImageDataBunch and pass it to a learner to see how well it does. For now we’ll just use a dataset with 1-5 elements.

Samples from our dataset. Notice how fastai automatically performs data augmentation for us!

Let’s create our learner and see how well it does with the defaults after 3 epochs.

epoch train_loss valid_loss error_rate
1 1.350247 0.767537 0.346000
2 0.930266 0.469457 0.165000
3 0.739811 0.415282 0.136000

So without any changes we’re sitting at over 85% accuracy. This surprised me as I thought this task would be harder for our neural network as each object it was counting has identical features. If we run this experiment again with a learning rate of 1e-4 and for 15 cycles things get even better:

epoch train_loss valid_loss error_rate
1 0.657094 0.406908 0.133000
2 0.632255 0.337327 0.100000
14 0.236516 0.039613 0.002000
15 0.264761 0.037968 0.002000

Wow! We’ve pushed the accuracy up to 99%!

Ugh. This seems wrong to me…

I am not a deep learning pro but every fiber of my being screams out against convolutional networks being THIS GOOD at this task. I specifically chose this task to try to find a failure case! My understanding is that they should be able to identify composite features that occur in an image but there is nothing in there that says they should be able to count (or have any notion of what counting means!)

What I would guess is happening here is that there are certain visual patterns that can only occur for a given number of circles (for example, one circle can never create a line) and that our network uses these features to uniquely identify each class. I’m not sure how to prove this but I have an idea of how we might break it. Maybe we can put so many circles on the screen that the unique patterns will become very hard to find. For example, instead of trying 1-5 circles, let’s try counting images that have 45-50 circles.

After re-generating our data (see Notebook for details) we can visualize it below:

Good luck finding visual patterns in this noise!

Now we can run our learner against this and see how it does:

epoch train_loss valid_loss error_rate
1 2.132017 2.023042 0.795833
2 1.861990 1.643421 0.711667
3 1.749233 1.663559 0.748333

Hah! That’s more like it. Now our network can only achieve ~25% accuracy which is slightly better than chance (1 in 5). Playing around with learning rate I was only able to achieve 27% on this task.

This makes more sense to me. There are no “features” in this image that would allow a network to look at it and instantly know how many circles are present. I suspect most humans can also not glance at one of these images and know whether or not there are 45 or 46 elements present. I suspect we would have to fall back to a different approach and manually count them out.

Update

It turns out that we CAN make this work! We just have to use more sensible transformations. For more info see my next post: Image Classification: Counting Part II.

 

 

2018: A retrospective

At the end of last year’s retrospective, I set a number of goals for myself. It feels (really) bad to look back and realize that I did complete a single one. I think it’s important to reflect on failures and shortcomings in order to understand them and hopefully overcome them going forward.

Goal 1: Write one blog post every week

Result: 13 posts / 52 weeks

In January 2018 I began the blog series Learn TensorFlow Now which walked users through the very basics of TensorFlow. For three months I stuck to my goal of writing one blog post every week and I’m very proud of how my published posts turned out. Unfortunately during April I took on a consulting project and my posts completely halted. Once I missed a single week I basically gave up on blogging altogether. While I don’t regret taking on a consulting project, I do regret that I used it as an excuse to stop blogging.

This year I would like to start over and try once again to write one blog post per week (off to a rough start considering it’s already the end of January!). I don’t really have a new strategy other than I will resolve not to quit entirely if I miss a week.

Goal 2: Read Deep Learning by Ian Goodfellow

Result: 300 pages / 700 pages

When I first started reading this book I was very intimidated by the first few chapters covering the background mathematics of deep learning. While my linear algebra was solid, my calculus was very weak. I put the book away for three months and grinded through Khan Academy’s calculus modules. I say “grinded” because I didn’t enjoy this process at all. Every day felt like a slog and my progress felt painfully slow. Even knowing calculus would ultimately be applicable to deep learning, I struggled to stay focused and interested in the work.

When I came back to the book in the second half of 2018 I realized it was a mistake to stop reading. While the review chapters were mathematically challenging, the actual deep learning portions were much less difficult and most of the insights could be reached without worrying about the math at all. For example, I cannot prove to you that L1 regularization results in sparse weight matrices, but I am aware that such a proof exists (at least in the case of linear regression).

This year I would like to finish this book. I think it might be worth my time to try to implement some of the basic algorithms illustrated in the book without the use of PyTorch or TensorFlow, but that will remain a stretch goal.

Goal 3: Contribute to TensorFlow

Result: 1 Contribution?

In February one of my revised PRs ended up making it into TensorFlow. Since I opened it in December of the previous year I’ve only marked it as half a contribution. Other than this PR I didn’t actively seek out any other places where I could contribute to TensorFlow.

On the plus side, I recently submitted a pull request to PyTorch. It’s a small PR that helps bring the C++ API closer to the Python API. Since it’s not yet merged I guess I should only count this as half a contribution? At least that puts me at one full contribution to deep learning libraries for the year.

Goal 4: Compete in a more Challenging Kaggle competition

Result: 0 attempts

There’s not much to say here other than that I didn’t really seek out or attempt any Kaggle competitions. In the later half of 2018 I began to focus on reinforcement learning so I was interested in other competitive environments such as OpenAI Gym and Halite.io. Unfortunately my RL agents were not very competitive when it came to Halite, but I’m hoping this year I will improve my RL knowledge and be able to submit some results to other competitions.

Goal 5: Work on HackerRank problems to strengthen my interview skills

Result: 3 months / 12 months

While I started off strong and completed lots of problems, I tapered off around the same time I stopped blogging. While I don’t feel super bad about stopping these exercises (I had started working, after all) I am a little sad because it didn’t really feel like I improved at solving questions. This remains an area I want to improve in but I don’t think I’m going to make it an explicit goal in 2019.

Goal 6: Get a job related to ML/AI

Result: 0 jobs

I did not receive (or apply to) any jobs in ML/AI during 2018. After focusing on consulting for most of the year I didn’t feel like I could demonstrate that I was proficient enough to be hired into the field. My understanding is that an end-to-end personal project is probably the best way to demonstrate true proficiency and something I want to pursue during 2019.

 

Goals for 2019

While I’m obviously not thrilled with my progress in 2018 I try not to consider failure a terminal state. I’m going to regroup and try to be more disciplined and consistent when it comes to my work this year. One activity that I’ve found both fun and productive is streaming on Twitch. I spent about 100 hours streaming and had a pretty consistent schedule during November and December.

  • Stream programming on Twitch during weekdays
  • Write one blog post every week
  • Finish reading Deep Learning by Ian Goodfellow

LTFN 12: Bias and Variance

Part of the series Learn TensorFlow Now

In the last few posts we noticed a strange phenomenon: our test accuracy was about 10% worse than what we were getting on our training set. Let’s review the results from our last network:

Cost: 131.964
Accuracy: 11.9999997318 %
...
Cost: 0.47334
Accuracy: 83.9999973774 %
Test Cost: 1.04789093912
Test accuracy: 72.5600001812 %

Our neural network is getting ~84% accuracy on the training set but only ~73% on the test set. What’s going on and how do we fix it?

Bias and Variance

Two primary sources of error in any machine learning algorithm come from either underfitting or overfitting your training data. Underfitting occurs when an algorithm is unable to model the underlying trend of the data. Overfitting occurs when the algorithm essentially memorizes the training set but is unable to generalize and performs poorly on the test set.

Bias is error introduced by underfitting a dataset. It is characterized by poor performance on both the training set and the test set.

Variance is error introduced by overfitting a dataset. It is characterized by a good performance on the training set, but a poor performance on test set.

We can look at bias and variance visually by comparing the performance of our network on the training set and test set. Recall our training accuracy of 84% and test accuracy of 73%:

Visualization of bias and variance from our previous network’s results

The above image roughly demonstrates which portions of our error can be attributed to bias and variance. This visualization assumes that we could theoretically achieve 100% accuracy. In practice this may not always be the case as other sources of error (eg. noise or mislabelled examples) may creep into our dataset. As an aside, the lowest theoretical error rate on a given problem is called the Bayes Error Rate.

Reducing Error

Ideally we would have a high performance on both the test set and training set which would represent low bias and low variance. So what steps can we take to reduce each of these sources of error?

Reducing Bias

  • Create a larger neural network. Recall that high bias is a sign that our neural network is unable to properly capture the underlying trend in our dataset. In general the deeper a network, the more complex the functions it can represent.
  • Train it for a very long time. One sanity check for any neural network is to see whether or not it can memorize the dataset. A sufficiently deep neural network should be able to memorize your dataset given enough training time. Although this won’t fix any problems with variance it can be an assurance that your network isn’t completely broken in some way.
  • Use a different architecture.  Sometimes your chosen architecture may simply be unable to perform well on a given task. It may be worth considering other architectures to see if they perform better. A good place to start with Image Recognition tasks is to try different architectures submitted to previous ImageNet competitions.

Reducing Variance

  • Get more data. One nice property of neural networks is that they typically generalize better and better as you feed them more data. If your model is having problems handling out-of-sample data one obvious solution is to feed it more data.
  • Augment your existing data. While “Get more data” is a simple solution, it’s often not easy in practice. It can take months to curate, clean and verify a large dataset. One workaround is to artifically generate “new” data by augmenting your existing data. For image recognition tasks this might include flipping or rotating existing images, tweaking color settings or taking random crops of images. This is a topic we’ll explore in greater depth in future posts.
  • Regularization. High variance with low bias suggests our network has memorized the training set. Regularization describes a class of modifications we can make to our neural network that either penalizes memorization (eg. L2 regularization) or promotes redundant paths of learning in our network (ie. Dropout). We will dive deeper into various regularization approaches in future posts.
  • Use a different architecture. Like reducing bias, sometimes you get the most bang-for-your-buck when you switch architectures altogether. As the deep learning field grows, people are frequently discovering better architectures for certain tasks. Some recent papers have even suggested that the structure of a neural network is more important than any learned weights for that structure.

There’s a lot to unpack here and we’ve glossed over many of the solutions to the problems of bias and variance. In the next few posts we’re going to revisit some of these ideas and explore different areas of the TensorFlow API that allow us to tackle these problems.