Kaggle Course: Week 2 Notes

Exploratory Data Analysis

While it’s tempting to just throw all our columns into a model, it’s worth our time to try to understand the data we’re modelling. We should:

Look for outliers
Look for errors. Consider adding Is_Incorrect column to mark rows with errors
Try to figure out how the data was generated
- eg. Microsoft Malware competition, the test set comes from time in the future. This made many version columns mostly useless for training
Look at data
- df.head().T
- For dates, find min, max and number of days for test and train

Visualizations for Individual Features

Histograms:
- plt.hist(x) or plt.hist(x)
- df['Column'].hist()
Plot Index vs. Values
- plt.plot(x, target)
- Look for patterns (data might not be shuffled)
Look at statistics for the data
- df.describe()
- x.mean()
- x.var()
- x.value_counts()
- x.isnull()

Visualizations for Feature Interactions

Compare two features against one another
- plt.scatter(x1, x2)
- We can use this to check whether or not the distributions look the same in both the test and train set
- Can be used to help generate new features
Compare multiple feature pairs against one another
- pd.scatter_matrix(df)
Correlation between columns
- df.corr()
- plt.matshow()
- Consider running K-Means clustering to order the columns first

Dataset Cleaning

Remove duplicate columns or columns with constant values
- train.nunique(axis=1) == 1
- train.T.drop_duplicates()
Remove duplicate categoricals by label encoding first then dropping:
- for f in categorical_features:
  train[f] = train[f].factorize()
  traintest.T.drop_duplicates()
Look for duplicate rows with different target values. Might be a mistake.
Look for identical rows in both train and test
Check if dataset is shuffled
- Plot feature vs Row Index

Validation Strategies

There are a few different approaches we can use to create validation sets:

Holdout
- Carve out a chunk of the dataset and test on it
- Good when we have a large, evenly balanced dataset
- sklearn.model_selection.ShuffleSplit
KFold
- Split the dataset into chunks, use each chunk as a holdout (training from scratch each time) and average the scores
- Good for medium sized database
- Stratification enforces identical target distribution over each fold
- sklearn.model_selection.Kfold
Leave-one-out
- Choose a single example for validation, train on others, repeat.
- Good for very small datasets
- sklearn.model_selection.LeaveOneOut

Choosing Data for Validation

We need to choose how to select values for our validation set. We want to build our validation set the same way as the organizers built their test set.

Random, row-wise
- Simply select random values until our validation set is filled up
Timewise
- Remove the final chunk based on time
- Used for time-series contests
By Id
- Sometimes the dataset contains multiple observations for multiple IDs and the test set will consist of IDs we’ve never seen before. In this case, we want to carve out a validation set of IDs that we don’t train on.
Combination
- Some combination of the above

Problems with Validation

Sometimes improvements on our validation set don’t improve our test score.

If we have high variance in our validation scores, we should do extensive validation
- Average scores from different KFold splits
- Tune model on one split, evaluate score on another split
If test scores don’t match validation score
- Check if there is too little data in public test set
- Check if we’ve overfit
- Check if we chose proper splitting strategy
- Check if train/test have different distributions

Kaggle Course: Week 2 Notes

Exploratory Data Analysis

Visualizations for Individual Features

Visualizations for Feature Interactions

Dataset Cleaning

Validation Strategies

Choosing Data for Validation

Problems with Validation

Published by joshvarty

Leave a comment Cancel reply

Exploratory Data Analysis

Visualizations for Individual Features

Visualizations for Feature Interactions

Dataset Cleaning

Validation Strategies

Choosing Data for Validation

Problems with Validation

Share this:

Related

Published by joshvarty

Leave a comment Cancel reply