Recently I’ve been playing around with my first Kaggle competition. At the same time I’ve been going through the video material from Coursera’s How to Win a Data Science Competition. While all of the lectures are useful, a few contain specific, actionable pieces of advice that I’d like to catalog here.
Feature Pre-Processing and Generation
Numeric
Frequently we would like to scale numeric inputs to our models in order to make learning easier. This is especially useful for non-tree based models. The most common ways to scale our numeric inputs are:
- MinMax – Scales values between 0 and 1.
sklearn.preprocessing.MinMaxScaler
- Normalization – Scales values to have mean=0 and std=1
- Good for neural networks
sklearn.preprocesssing.StandardScaler
- Log Transform
- Good for neural networks
-
np.log(1+x)
We should look for outliers by plotting values. After finding them:
- Clip our values between a chosen range. (eg.
1st
and99th
percentile)np.clip(x, UPPERBOUND, LOWERBOUND)
- Rank
- Simply order the numeric values. Automatically deals with outliers.
scipy.stats.rankdata.rank
Keep in mind we can train different models on differently scaled/pre-processed data and then mix the models together. This might help if we’re not 100% sure which pre-processing steps are best.
Categorical
Most models need categorical data to be encoded in some way before the model can work with it. That is, we can’t just feed category strings to a model, we have to convert them to some number or vector first.
- Label Encoding
- Simply assign each category a number
sklearn.preprocessing.LabelEncoder
Pandas.factorize
- Frequency Encoding
- Give each category a number based on how many times it appears in the combined train and test sets
- eg. Map to a percentage then optionally rank
- One-hot Encoding
- Make a new column for each category with a single
1
and all other0
- Good for neural networks
pandas.get_dummies
sklearn.preprocessing.OneHotEncoder
- Make a new column for each category with a single
DateTime
We can add lots of useful relationships that most (all?) models struggle to capture.
- Capture Periodicity
- Break apart dates into Year, Day in Week, Day in Year etc.
- Time-Since a particular event
- Seconds passed since Jan 1, 1970
- Days since last holiday
- Difference between dates
Missing Values
- Typically replace missing numerics with extreme values (-999), mean or median
- Might already be replaced in the dataset
- Adding
IsNan
feature can be useful - Replace
nans
after feature generation- We don’t want the replaced nans to have any impact on means or other features we create
- Some frameworks/algorithms can handle
nans
- fastai
- xgboost