Kaggle Course: Week 2 Notes

Exploratory Data Analysis

While it’s tempting to just throw all our columns into a model, it’s worth our time to try to understand the data we’re modelling. We should:

  • Look for outliers
  • Look for errors. Consider adding Is_Incorrect column to mark rows with errors
  • Try to figure out how the data was generated
    • eg. Microsoft Malware competition, the test set comes from time in the future. This made many version columns mostly useless for training
  • Look at data
    • df.head().T
    • For dates, find min, max and number of days for test and train

Visualizations for Individual Features

  • Histograms:
    • plt.hist(x) or plt.hist(x)
    • df['Column'].hist()
  • Plot Index vs. Values
    • plt.plot(x, target)
    • Look for patterns (data might not be shuffled)
  • Look at statistics for the data
    • df.describe()
    • x.mean()
    • x.var()
    • x.value_counts()
    • x.isnull()

Visualizations for Feature Interactions

  • Compare two features against one another
    • plt.scatter(x1, x2)
    • We can use this to check whether or not the distributions look the same in both the test and train set
    • Can be used to help generate new features
  • Compare multiple feature pairs against one another
    • pd.scatter_matrix(df)
  • Correlation between columns
    • df.corr()
    • plt.matshow()
    • Consider running K-Means clustering to order the columns first

Dataset Cleaning

  • Remove duplicate columns or columns with constant values
    • train.nunique(axis=1) == 1
    • train.T.drop_duplicates()
  • Remove duplicate categoricals by label encoding first then dropping:
    • for f in categorical_features:
        train[f] = train[f].factorize()
      traintest.T.drop_duplicates()
  • Look for duplicate rows with different target values. Might be a mistake.
  • Look for identical rows in both train and test
  • Check if dataset is shuffled
    • Plot feature vs Row Index

Validation Strategies

There are a few different approaches we can use to create validation sets:

  • Holdout
    • Carve out a chunk of the dataset and test on it
    • Good when we have a large, evenly balanced dataset
    • sklearn.model_selection.ShuffleSplit
  • KFold
    • Split the dataset into chunks, use each chunk as a holdout (training from scratch each time) and average the scores
    • Good for medium sized database
    • Stratification enforces identical target distribution over each fold
    • sklearn.model_selection.Kfold
  • Leave-one-out
    • Choose a single example for validation, train on others, repeat.
    • Good for very small datasets
    • sklearn.model_selection.LeaveOneOut

Choosing Data for Validation

We need to choose how to select values for our validation set. We want to build our validation set the same way as the organizers built their test set.

  • Random, row-wise
    • Simply select random values until our validation set is filled up
  • Timewise
    • Remove the final chunk based on time
    • Used for time-series contests
  • By Id
    • Sometimes the dataset contains multiple observations for multiple IDs and the test set will consist of IDs we’ve never seen before. In this case, we want to carve out a validation set of IDs that we don’t train on.
  • Combination
    • Some combination of the above

Problems with Validation

Sometimes improvements on our validation set don’t improve our test score.

  • If we have high variance in our validation scores, we should do extensive validation
    • Average scores from different KFold splits
    • Tune model on one split, evaluate score on another split
  • If test scores don’t match validation score
    • Check if there is too little data in public test set
    • Check if we’ve overfit
    • Check if we chose proper splitting strategy
    • Check if train/test have different distributions

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s