Kaggle Course: Week 1 Notes

Recently I’ve been playing around with my first Kaggle competition. At the same time I’ve been going through the video material from Coursera’s How to Win a Data Science Competition. While all of the lectures are useful, a few contain specific, actionable pieces of advice that I’d like to catalog here.

Feature Pre-Processing and Generation

Numeric

Frequently we would like to scale numeric inputs to our models in order to make learning easier. This is especially useful for non-tree based models. The most common ways to scale our numeric inputs are:

  • MinMax – Scales values between 0 and 1.
    • sklearn.preprocessing.MinMaxScaler
  • Normalization – Scales values to have mean=0 and std=1
    • Good for neural networks
    • sklearn.preprocesssing.StandardScaler
  • Log Transform
    • Good for neural networks
    • ​​np.log(1+x)

We should look for outliers by plotting values. After finding them:

  • Clip our values between a chosen range. (eg. 1st and 99th percentile)
    • np.clip(x, UPPERBOUND, LOWERBOUND)
  • Rank
    • Simply order the numeric values. Automatically deals with outliers.
    • scipy.stats.rankdata.rank

Keep in mind we can train different models on differently scaled/pre-processed data and then mix the models together. This might help if we’re not 100% sure which pre-processing steps are best.

Categorical

Most models need categorical data to be encoded in some way before the model can work with it. That is, we can’t just feed category strings to a model, we have to convert them to some number or vector first.

  • Label Encoding
    • Simply assign each category a number
    • sklearn.preprocessing.LabelEncoder
    • Pandas.factorize
  • Frequency Encoding
    • Give each category a number based on how many times it appears in the combined train and test sets
    • eg. Map to a percentage then optionally rank
  • One-hot Encoding
    • Make a new column for each category with a single 1 and all other 0
    • Good for neural networks
    • pandas.get_dummies
    • ​​sklearn.preprocessing.OneHotEncoder

DateTime

We can add lots of useful relationships that most (all?) models struggle to capture.

  • Capture Periodicity
    • Break apart dates into Year, Day in Week, Day in Year etc.
  • Time-Since a particular event
    • Seconds passed since Jan 1, 1970
    • Days since last holiday
  • Difference between dates

Missing Values

  • Typically replace missing numerics with extreme values (-999), mean or median
  • Might already be replaced in the dataset
  • Adding IsNan feature can be useful
  • Replace nans after feature generation
    • We don’t want the replaced nans to have any impact on means or other features we create
  • Some frameworks/algorithms can handle nans
    • fastai
    • xgboost

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s