Leakage

Leakage = your model got access to information it wouldn’t have at prediction time.

This can happen subtly and make results “too good.”

Common leakage examples

(A) Preprocessing leakage

  • You compute mean/std for normalization using the entire dataset (including val/test).
  • Or you do PCA on the whole dataset.

Fix:

  • Fit preprocessing only on train → apply those same parameters to val/test.

(B) Feature leakage

  • A feature directly or indirectly contains future info or the label.
    • e.g., “account status after 7 days” used to predict churn at day 7
    • “final price” used to predict “will it go up” at decision time

(C) Time leakage

  • Randomly shuffling time-series and splitting can leak future patterns into train.

Fix:

  • Use time-based split / walk-forward.

Interview line:
“Leakage is any path where validation/test information influences training—via features, preprocessing, or time. I fit all transforms on train only and use time-aware splits when needed.”