Exploratory Data Analysis
While it’s tempting to just throw all our columns into a model, it’s worth our time to try to understand the data we’re modelling. We should:
- Look for outliers
- Look for errors. Consider adding
Is_Incorrect
column to mark rows with errors - Try to figure out how the data was generated
- eg. Microsoft Malware competition, the test set comes from time in the future. This made many
version
columns mostly useless for training
- eg. Microsoft Malware competition, the test set comes from time in the future. This made many
- Look at data
df.head().T
- For dates, find min, max and number of days for
test
andtrain
Visualizations for Individual Features
- Histograms:
plt.hist(x)
orplt.hist(x)
df['Column'].hist()
- Plot Index vs. Values
plt.plot(x, target)
- Look for patterns (data might not be shuffled)
- Look at statistics for the data
df.describe()
x.mean()
x.var()
x.value_counts()
x.isnull()
Visualizations for Feature Interactions
- Compare two features against one another
plt.scatter(x1, x2)
- We can use this to check whether or not the distributions look the same in both the
test
andtrain
set - Can be used to help generate new features
- Compare multiple feature pairs against one another
pd.scatter_matrix(df)
- Correlation between columns
df.corr()
plt.matshow()
- Consider running K-Means clustering to order the columns first
Dataset Cleaning
- Remove duplicate columns or columns with constant values
train.nunique(axis=1) == 1
train.T.drop_duplicates()
- Remove duplicate categoricals by label encoding first then dropping:
for f in categorical_features:
train[f] = train[f].factorize()
traintest.T.drop_duplicates()
- Look for duplicate rows with different target values. Might be a mistake.
- Look for identical rows in both
train
andtest
- Check if dataset is shuffled
- Plot feature vs Row Index
Validation Strategies
There are a few different approaches we can use to create validation sets:
- Holdout
- Carve out a chunk of the dataset and test on it
- Good when we have a large, evenly balanced dataset
sklearn.model_selection.ShuffleSplit
- KFold
- Split the dataset into chunks, use each chunk as a holdout (training from scratch each time) and average the scores
- Good for medium sized database
- Stratification enforces identical target distribution over each fold
sklearn.model_selection.Kfold
- Leave-one-out
- Choose a single example for validation, train on others, repeat.
- Good for very small datasets
sklearn.model_selection.LeaveOneOut
Choosing Data for Validation
We need to choose how to select values for our validation set. We want to build our validation set the same way as the organizers built their test set.
- Random, row-wise
- Simply select random values until our validation set is filled up
- Timewise
- Remove the final chunk based on time
- Used for time-series contests
- By Id
- Sometimes the dataset contains multiple observations for multiple IDs and the test set will consist of IDs we’ve never seen before. In this case, we want to carve out a validation set of IDs that we don’t train on.
- Combination
- Some combination of the above
Problems with Validation
Sometimes improvements on our validation set don’t improve our test score.
- If we have high variance in our validation scores, we should do extensive validation
- Average scores from different KFold splits
- Tune model on one split, evaluate score on another split
- If test scores don’t match validation score
- Check if there is too little data in public test set
- Check if we’ve overfit
- Check if we chose proper splitting strategy
- Check if train/test have different distributions