Class Imbalance
How do we handle class imbalances?
Class imbalance = your model can get “high accuracy” by mostly predicting the majority class, but still be useless on the minority class. You handle it by changing (1) how you measure, (2) how you train, and sometimes (3) the data.
1) Fix the metric first (so you don’t lie to yourself)
Accuracy is often the wrong metric.
Use:
- Precision / Recall (or F1) for the minority class
- PR-AUC (often better than ROC-AUC when positives are rare)
- Balanced accuracy or macro-F1 (treat classes equally)
- Confusion matrix + per-class recall
2) Change the decision threshold (almost always needed)
Most classifiers output a probability score. The default threshold 0.5 is arbitrary.
- If you care about catching positives: lower threshold → higher recall (more false positives).
- If you care about avoiding false alarms: raise threshold → higher precision.
In interviews: “I’d tune threshold on a validation set to hit a target recall/precision depending on the business cost.”
3) Make training “care” about the minority class
Cost-sensitive learning / class weights
- Logistic regression / SVM / neural nets: weight minority class more in the loss.
- In PyTorch: weighted cross-entropy, focal loss, etc.
- In tree models:
class_weight="balanced"(or similar).
Focal loss (common for extreme imbalance)
- Down-weights easy majority examples, focuses on hard/rare ones.
4) Resample the data (useful, but can bite you)
Oversampling
- Duplicate minority samples or use SMOTE (synthetic minority).
- Risk: overfitting to minority duplicates, SMOTE can create weird samples.
Undersampling
- Drop majority samples.
- Risk: throwing away useful information.
Rule of thumb: start with class weights; add resampling if needed.
5) Use models/algorithms that handle imbalance well
- Tree ensembles (Random Forest / XGBoost / LightGBM) + proper class weights often work well.
- For anomaly-like problems: one-class methods or anomaly detection can be appropriate.
6) Use the right validation split (common gotcha)
- Use stratified splits so class proportions are preserved.
- Avoid leakage if time-based: use time splits, still monitor minority metrics.
7) Calibrate probabilities (if you care about “real” probabilities)
Imbalance can mess with probability calibration.
- Use Platt scaling / isotonic regression on a validation set.
What I’d say in an interview (clean answer)
“I don’t trust accuracy. I pick a metric like PR-AUC / F1 and tune the decision threshold based on costs. During training I use class-weighted loss (or focal loss if it’s extreme). If that’s not enough, I consider resampling (careful with overfitting) and validate with stratified splits.”