Leakage
Leakage = your model got access to information it wouldn’t have at prediction time.
This can happen subtly and make results “too good.”
Common leakage examples
(A) Preprocessing leakage
- You compute mean/std for normalization using the entire dataset (including val/test).
- Or you do PCA on the whole dataset.
Fix:
- Fit preprocessing only on train → apply those same parameters to val/test.
(B) Feature leakage
- A feature directly or indirectly contains future info or the label.
- e.g., “account status after 7 days” used to predict churn at day 7
- “final price” used to predict “will it go up” at decision time
(C) Time leakage
- Randomly shuffling time-series and splitting can leak future patterns into train.
Fix:
- Use time-based split / walk-forward.
Interview line:
“Leakage is any path where validation/test information influences training—via features, preprocessing, or time. I fit all transforms on train only and use time-aware splits when needed.”