Kaggle Course: Week 2 Notes

Exploratory Data Analysis

While it’s tempting to just throw all our columns into a model, it’s worth our time to try to understand the data we’re modelling. We should:

  • Look for outliers
  • Look for errors. Consider adding Is_Incorrect column to mark rows with errors
  • Try to figure out how the data was generated
    • eg. Microsoft Malware competition, the test set comes from time in the future. This made many version columns mostly useless for training
  • Look at data
    • df.head().T
    • For dates, find min, max and number of days for test and train

Visualizations for Individual Features

  • Histograms:
    • plt.hist(x) or plt.hist(x)
    • df['Column'].hist()
  • Plot Index vs. Values
    • plt.plot(x, target)
    • Look for patterns (data might not be shuffled)
  • Look at statistics for the data
    • df.describe()
    • x.mean()
    • x.var()
    • x.value_counts()
    • x.isnull()

Visualizations for Feature Interactions

  • Compare two features against one another
    • plt.scatter(x1, x2)
    • We can use this to check whether or not the distributions look the same in both the test and train set
    • Can be used to help generate new features
  • Compare multiple feature pairs against one another
    • pd.scatter_matrix(df)
  • Correlation between columns
    • df.corr()
    • plt.matshow()
    • Consider running K-Means clustering to order the columns first

Dataset Cleaning

  • Remove duplicate columns or columns with constant values
    • train.nunique(axis=1) == 1
    • train.T.drop_duplicates()
  • Remove duplicate categoricals by label encoding first then dropping:
    • for f in categorical_features:
        train[f] = train[f].factorize()
  • Look for duplicate rows with different target values. Might be a mistake.
  • Look for identical rows in both train and test
  • Check if dataset is shuffled
    • Plot feature vs Row Index

Validation Strategies

There are a few different approaches we can use to create validation sets:

  • Holdout
    • Carve out a chunk of the dataset and test on it
    • Good when we have a large, evenly balanced dataset
    • sklearn.model_selection.ShuffleSplit
  • KFold
    • Split the dataset into chunks, use each chunk as a holdout (training from scratch each time) and average the scores
    • Good for medium sized database
    • Stratification enforces identical target distribution over each fold
    • sklearn.model_selection.Kfold
  • Leave-one-out
    • Choose a single example for validation, train on others, repeat.
    • Good for very small datasets
    • sklearn.model_selection.LeaveOneOut

Choosing Data for Validation

We need to choose how to select values for our validation set. We want to build our validation set the same way as the organizers built their test set.

  • Random, row-wise
    • Simply select random values until our validation set is filled up
  • Timewise
    • Remove the final chunk based on time
    • Used for time-series contests
  • By Id
    • Sometimes the dataset contains multiple observations for multiple IDs and the test set will consist of IDs we’ve never seen before. In this case, we want to carve out a validation set of IDs that we don’t train on.
  • Combination
    • Some combination of the above

Problems with Validation

Sometimes improvements on our validation set don’t improve our test score.

  • If we have high variance in our validation scores, we should do extensive validation
    • Average scores from different KFold splits
    • Tune model on one split, evaluate score on another split
  • If test scores don’t match validation score
    • Check if there is too little data in public test set
    • Check if we’ve overfit
    • Check if we chose proper splitting strategy
    • Check if train/test have different distributions

Kaggle Course: Week 1 Notes

Recently I’ve been playing around with my first Kaggle competition. At the same time I’ve been going through the video material from Coursera’s How to Win a Data Science Competition. While all of the lectures are useful, a few contain specific, actionable pieces of advice that I’d like to catalog here.

Feature Pre-Processing and Generation


Frequently we would like to scale numeric inputs to our models in order to make learning easier. This is especially useful for non-tree based models. The most common ways to scale our numeric inputs are:

  • MinMax – Scales values between 0 and 1.
    • sklearn.preprocessing.MinMaxScaler
  • Normalization – Scales values to have mean=0 and std=1
    • Good for neural networks
    • sklearn.preprocesssing.StandardScaler
  • Log Transform
    • Good for neural networks
    • ​​np.log(1+x)

We should look for outliers by plotting values. After finding them:

  • Clip our values between a chosen range. (eg. 1st and 99th percentile)
    • np.clip(x, UPPERBOUND, LOWERBOUND)
  • Rank
    • Simply order the numeric values. Automatically deals with outliers.
    • scipy.stats.rankdata.rank

Keep in mind we can train different models on differently scaled/pre-processed data and then mix the models together. This might help if we’re not 100% sure which pre-processing steps are best.


Most models need categorical data to be encoded in some way before the model can work with it. That is, we can’t just feed category strings to a model, we have to convert them to some number or vector first.

  • Label Encoding
    • Simply assign each category a number
    • sklearn.preprocessing.LabelEncoder
    • Pandas.factorize
  • Frequency Encoding
    • Give each category a number based on how many times it appears in the combined train and test sets
    • eg. Map to a percentage then optionally rank
  • One-hot Encoding
    • Make a new column for each category with a single 1 and all other 0
    • Good for neural networks
    • pandas.get_dummies
    • ​​sklearn.preprocessing.OneHotEncoder


We can add lots of useful relationships that most (all?) models struggle to capture.

  • Capture Periodicity
    • Break apart dates into Year, Day in Week, Day in Year etc.
  • Time-Since a particular event
    • Seconds passed since Jan 1, 1970
    • Days since last holiday
  • Difference between dates

Missing Values

  • Typically replace missing numerics with extreme values (-999), mean or median
  • Might already be replaced in the dataset
  • Adding IsNan feature can be useful
  • Replace nans after feature generation
    • We don’t want the replaced nans to have any impact on means or other features we create
  • Some frameworks/algorithms can handle nans
    • fastai
    • xgboost