# Kaggle Course: Week 1 Notes

Recently I’ve been playing around with my first Kaggle competition. At the same time I’ve been going through the video material from Coursera’s How to Win a Data Science Competition. While all of the lectures are useful, a few contain specific, actionable pieces of advice that I’d like to catalog here.

## Feature Pre-Processing and Generation

### Numeric

Frequently we would like to scale numeric inputs to our models in order to make learning easier. This is especially useful for non-tree based models. The most common ways to scale our numeric inputs are:

• MinMax – Scales values between 0 and 1.
• `sklearn.preprocessing.MinMaxScaler`
• Normalization – Scales values to have mean=0 and std=1
• Good for neural networks
• `sklearn.preprocesssing.StandardScaler`
• Log Transform
• Good for neural networks
• `​​np.log(1+x)`

We should look for outliers by plotting values. After finding them:

• Clip our values between a chosen range. (eg. `1st` and `99th` percentile)
• `np.clip(x, UPPERBOUND, LOWERBOUND)`
• Rank
• Simply order the numeric values. Automatically deals with outliers.
• `scipy.stats.rankdata.rank`

Keep in mind we can train different models on differently scaled/pre-processed data and then mix the models together. This might help if we’re not 100% sure which pre-processing steps are best.

### Categorical

Most models need categorical data to be encoded in some way before the model can work with it. That is, we can’t just feed category strings to a model, we have to convert them to some number or vector first.

• Label Encoding
• Simply assign each category a number
• `sklearn.preprocessing.LabelEncoder`
• `Pandas.factorize`
• Frequency Encoding
• Give each category a number based on how many times it appears in the combined train and test sets
• eg. Map to a percentage then optionally rank
• One-hot Encoding
• Make a new column for each category with a single `1` and all other `0`
• Good for neural networks
• `pandas.get_dummies`
• `​​sklearn.preprocessing.OneHotEncoder`

### DateTime

We can add lots of useful relationships that most (all?) models struggle to capture.

• Capture Periodicity
• Break apart dates into Year, Day in Week, Day in Year etc.
• Time-Since a particular event
• Seconds passed since Jan 1, 1970
• Days since last holiday
• Difference between dates

### Missing Values

• Typically replace missing numerics with extreme values (-999), mean or median
• Might already be replaced in the dataset
• Adding `IsNan` feature can be useful
• Replace `nans` after feature generation
• We don’t want the replaced nans to have any impact on means or other features we create
• Some frameworks/algorithms can handle `nans`
• fastai
• xgboost