Boosting
Boosting combines many weak learners (55% accuracy, barely better than random) into a strong learner (90%+) by iteratively reweighting the dataset.
Why reweight instead of resample like bagging?
Bagging reduces variance by averaging independent-ish learners. Boosting targets bias: each new learner focuses on the mistakes of the previous ones, so the committee as a whole can represent functions none of them could alone.
Unlike bagging, boosting is inherently sequential.
Iterative process:
- Train a classifier on the current weights
- Downweight points it gets right, upweight points it gets wrong
- Train the next classifier on the new weighted dataset
- Final prediction: weighted vote of all learners
Detour: Online Learning with Experts (Hedge)
Before AdaBoost, consider the prediction-with-experts setting. For :
- Algorithm chooses a distribution over experts
- Adversary reveals loss vector
- Algorithm incurs loss
Goal: minimize , competing with the best single expert in hindsight .
Hedge Algorithm
Parameter :
- Init
- For :
- Suffer loss
- Update (downweight bad experts)
Guarantee:
where:
- is the learning-rate-like parameter
- is the number of experts, the number of rounds
Average regret goes to 0 as . Extremely general, used in linear programming, game theory, etc.
AdaBoost: Hedge on Datapoints
The insight: instead of treating classifiers as experts, treat datapoints as experts. Upweight points that havenβt been learned yet.
Intuition
Each round, the algorithm stares at which points the current committee still botches and says βnext learner, focus here.β Easy points (already correct) get turned down; hard points get turned up until some weak learner picks up on them. You end up with a committee where each member specializes in a different slice of the hard cases, and the weighted vote stitches the specialists together.
- Init weights
- For :
- Normalize:
- Run
WeakLearnon the weighted training set to get classifier - Compute weighted error (should be by the weak learner guarantee)
- Let and update (upweight points misclassified by )
- Final classifier: weighted vote,
Training Error Bound
If each weak learner has error , then training error decays exponentially:
where:
- is the weak learner edge over random guessing
- is the number of boosting rounds
Alternate View: Gradient Descent on Exponential Loss
AdaBoost can be derived as coordinate descent on . This perspective underlies Gradient Boosted Trees, which generalize to any differentiable loss.
Intuition
Boosting is greedy gradient descent in function space. Each weak learner fits the residual (the negative gradient of the loss at the current ensembleβs predictions) and gets added with a small step size. The weight-updating view and the gradient-descent view are the same algorithm seen from two angles: βfocus on what we got wrongβ is literally βmove in the direction of the gradient.β
Overfitting behavior
AdaBoost surprisingly often keeps improving test error even after training error hits 0, as the margin keeps growing. With very expressive base learners, it can and does overfit.
Viola-Jones face detection (2001): start with very simple Haar-like feature classifiers, use boosting to combine into a cascade. First real-time face detector.
Bagging vs Boosting
| Bagging | Boosting | |
|---|---|---|
| Goal | Reduce variance | Reduce bias |
| Data | Bootstrap (parallel) | Reweighted (sequential) |
| Parallel? | Yes | No, inherently sequential |
| Base learner | Full-depth trees | Weak (decision stumps) |
Both are simple and flexible with any base learner, but both lose some interpretability compared to a single tree.
Slides: http://www.gautamkamath.com/courses/CS480-fa2025-files/lec8.pdf