Leakage

Leakage = your model got access to information it wouldn’t have at prediction time.

This can happen subtly and make results “too good.”

Common leakage examples

(A) Preprocessing leakage

You compute mean/std for normalization using the entire dataset (including val/test).
Or you do PCA on the whole dataset.

Fix:

Fit preprocessing only on train → apply those same parameters to val/test.

(B) Feature leakage

A feature directly or indirectly contains future info or the label.
- e.g., “account status after 7 days” used to predict churn at day 7
- “final price” used to predict “will it go up” at decision time

(C) Time leakage

Randomly shuffling time-series and splitting can leak future patterns into train.

Fix:

Use time-based split / walk-forward.

Interview line:
“Leakage is any path where validation/test information influences training—via features, preprocessing, or time. I fit all transforms on train only and use time-aware splits when needed.”

🛠️ Steven Gong

Table of Contents

Leakage

Common leakage examples

Graph View

Backlinks