Kaggle Course: Week 1 Notes

Recently I’ve been playing around with my first Kaggle competition. At the same time I’ve been going through the video material from Coursera’s How to Win a Data Science Competition. While all of the lectures are useful, a few contain specific, actionable pieces of advice that I’d like to catalog here.

Feature Pre-Processing and Generation

Numeric

Frequently we would like to scale numeric inputs to our models in order to make learning easier. This is especially useful for non-tree based models. The most common ways to scale our numeric inputs are:

MinMax – Scales values between 0 and 1.
- sklearn.preprocessing.MinMaxScaler
Normalization – Scales values to have mean=0 and std=1
- Good for neural networks
- sklearn.preprocesssing.StandardScaler
Log Transform
- Good for neural networks
- np.log(1+x)

We should look for outliers by plotting values. After finding them:

Clip our values between a chosen range. (eg. 1st and 99th percentile)
- np.clip(x, UPPERBOUND, LOWERBOUND)
Rank
- Simply order the numeric values. Automatically deals with outliers.
- scipy.stats.rankdata.rank

Keep in mind we can train different models on differently scaled/pre-processed data and then mix the models together. This might help if we’re not 100% sure which pre-processing steps are best.

Categorical

Most models need categorical data to be encoded in some way before the model can work with it. That is, we can’t just feed category strings to a model, we have to convert them to some number or vector first.

Label Encoding
- Simply assign each category a number
- sklearn.preprocessing.LabelEncoder
- Pandas.factorize
Frequency Encoding
- Give each category a number based on how many times it appears in the combined train and test sets
- eg. Map to a percentage then optionally rank
One-hot Encoding
- Make a new column for each category with a single 1 and all other 0
- Good for neural networks
- pandas.get_dummies
- sklearn.preprocessing.OneHotEncoder

DateTime

We can add lots of useful relationships that most (all?) models struggle to capture.

Capture Periodicity
- Break apart dates into Year, Day in Week, Day in Year etc.
Time-Since a particular event
- Seconds passed since Jan 1, 1970
- Days since last holiday
Difference between dates

Missing Values

Typically replace missing numerics with extreme values (-999), mean or median
Might already be replaced in the dataset
Adding IsNan feature can be useful
Replace nans after feature generation
- We don’t want the replaced nans to have any impact on means or other features we create
Some frameworks/algorithms can handle nans
- fastai
- xgboost

Kaggle Course: Week 1 Notes

Feature Pre-Processing and Generation

Numeric

Categorical

DateTime

Missing Values

Published by joshvarty

Leave a comment Cancel reply

Feature Pre-Processing and Generation

Numeric

Categorical

DateTime

Missing Values

Share this:

Related

Published by joshvarty

Leave a comment Cancel reply