Abracadabra

Validation Checklist in Kaggle Competition

Data Splitting Strategies

  • Random
  • Timewise
  • By id (maybe hidden)
  • Combined

Notices

  • Make sure the strategy used by train/val splitting is same as train/test splitting.
  • Different models trained from different data splitting strategies have much performance gap.
  • Logic of feature generation depends on the data splitting strategy.

Validation problems

  • Validation stage
    • Causes of different scores and optimal parameters
      • Too little data
      • Too diverse and inconsistent data
    • Solutions
      • Average scores from different K-Fold splits
      • Tune model on one split and evaluate score on the other
  • Submission stage
    • We can observe that
      • LB score is consistently higher/lower than validation score
      • LB score is not correlated with validation score at all
    • Causes
      • We may already have quite different scores in K-Fold
        • make sure split train/validation correct
      • too litter data in public LB
        • Just trust your validation scores
      • train and test data are from different distributions
        • classes show in the test set not show in the train set
          • make a shift to your prediction (mean of train minus mean of test) – LB probing
        • classes ratio is not same
          • make the validation classes ratio is same as test classes ratio

Expect LB shuffle because of

  • Randomness
  • Litter amount of data
  • Different public/private distributions