Boosting

Boosting combines many weak learners (55% accuracy, barely better than random) into a strong learner (90%+) by iteratively reweighting the dataset.

Why reweight instead of resample like bagging?

Bagging reduces variance by averaging independent-ish learners. Boosting targets bias: each new learner focuses on the mistakes of the previous ones, so the committee as a whole can represent functions none of them could alone.

Unlike bagging, boosting is inherently sequential.

Iterative process:

Train a classifier on the current weights
Downweight points it gets right, upweight points it gets wrong
Train the next classifier on the new weighted dataset
Final prediction: weighted vote of all learners

Detour: Online Learning with Experts (Hedge)

Before AdaBoost, consider the prediction-with-experts setting. For $t = 1, \dots, T$ :

Algorithm chooses a distribution $p^{(t)} \in Δ^{n}$ over $n$ experts
Adversary reveals loss vector $ℓ^{(t)} \in [0, 1]^{n}$
Algorithm incurs loss $⟨ p^{(t)}, ℓ^{(t)} ⟩$

Goal: minimize $\sum_{t} ⟨ p^{(t)}, ℓ^{(t)} ⟩$ , competing with the best single expert in hindsight $min_{i} \sum_{t} ℓ_{i}^{(t)}$ .

Hedge Algorithm

Parameter $β \in [0, 1]$ :

Init $w^{(1)} = (1/ n, \dots, 1/ n)$
For $t = 1, \dots, T$ :
- $p^{(t)} = w^{(t)} / \sum_{i} w_{i}^{(t)}$
- Suffer loss $⟨ p^{(t)}, ℓ^{(t)} ⟩$
- Update $w_{i}^{(t + 1)} = w_{i}^{(t)} β^{ℓ_{i}^{(t)}}$ (downweight bad experts)

Guarantee:

$\sum_{t} ⟨ p^{(t)}, ℓ^{(t)} ⟩ \leq \frac{1}{1 - β} [lo g n + lo g (1/ β) min_{i} \sum_{t} ℓ_{i}^{(t)}]$

where:

$β \in [0, 1]$ is the learning-rate-like parameter
$n$ is the number of experts, $T$ the number of rounds

Average regret goes to 0 as $T \to \infty$ . Extremely general, used in linear programming, game theory, etc.

AdaBoost: Hedge on Datapoints

The insight: instead of treating classifiers as experts, treat datapoints as experts. Upweight points that haven’t been learned yet.

Intuition

Each round, the algorithm stares at which points the current committee still botches and says “next learner, focus here.” Easy points (already correct) get turned down; hard points get turned up until some weak learner picks up on them. You end up with a committee where each member specializes in a different slice of the hard cases, and the weighted vote stitches the specialists together.

Init weights $w^{(1)} = (1/ n, \dots, 1/ n)$
For $t = 1, \dots, T$ :
1. Normalize: $p^{(t)} = w^{(t)} / \sum_{i} w_{i}^{(t)}$
2. Run WeakLearn on the weighted training set to get classifier $h^{(t)} : (x, y) \to [0, 1]$
3. Compute weighted error $ε_{t} = \sum_{i} p_{i}^{(t)} ∣ h^{(t)} (x_{i}) - y_{i} ∣$ (should be $< 1/2$ by the weak learner guarantee)
4. Let $β_{t} = ε_{t} / (1 - ε_{t})$ and update $w_{i}^{(t + 1)} = w_{i}^{(t)} β_{t}^{1 - ∣ h^{(t)} (x_{i}) - y_{i} ∣}$ (upweight points misclassified by $h^{(t)}$ )
Final classifier: weighted vote,

$h (x) = 1 [\sum_{t = 1}^{T} lo g \frac{1}{β _{t}} h^{(t)} (x) \geq \frac{1}{2} \sum_{t = 1}^{T} lo g \frac{1}{β _{t}}]$

Training Error Bound

If each weak learner has error $ε_{t} \leq \frac{1}{2} - γ$ , then training error decays exponentially:

$\frac{1}{n} \sum_{i} 1 [h (x_{i}) \neq = y_{i}] \leq 2^{T} \prod_{t = 1}^{T} ε_{t} (1 - ε_{t}) \leq exp (- 2 T γ^{2})$

where:

$γ$ is the weak learner edge over random guessing
$T$ is the number of boosting rounds

Alternate View: Gradient Descent on Exponential Loss

AdaBoost can be derived as coordinate descent on $\sum_{i} exp (- y_{i} h (x_{i}))$ . This perspective underlies Gradient Boosted Trees, which generalize to any differentiable loss.

Intuition

Boosting is greedy gradient descent in function space. Each weak learner fits the residual (the negative gradient of the loss at the current ensemble’s predictions) and gets added with a small step size. The weight-updating view and the gradient-descent view are the same algorithm seen from two angles: “focus on what we got wrong” is literally “move in the direction of the gradient.”

Overfitting behavior

AdaBoost surprisingly often keeps improving test error even after training error hits 0, as the margin keeps growing. With very expressive base learners, it can and does overfit.

Viola-Jones face detection (2001): start with very simple Haar-like feature classifiers, use boosting to combine into a cascade. First real-time face detector.

Bagging vs Boosting

	Bagging	Boosting
Goal	Reduce variance	Reduce bias
Data	Bootstrap (parallel)	Reweighted (sequential)
Parallel?	Yes	No, inherently sequential
Base learner	Full-depth trees	Weak (decision stumps)

Both are simple and flexible with any base learner, but both lose some interpretability compared to a single tree.

Slides: http://www.gautamkamath.com/courses/CS480-fa2025-files/lec8.pdf

🛠️ Steven Gong

Table of Contents

Boosting

Detour: Online Learning with Experts (Hedge)

Hedge Algorithm

AdaBoost: Hedge on Datapoints

Training Error Bound

Alternate View: Gradient Descent on Exponential Loss

Bagging vs Boosting

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Boosting

Detour: Online Learning with Experts (Hedge)

Hedge Algorithm

AdaBoost: Hedge on Datapoints

Training Error Bound

Alternate View: Gradient Descent on Exponential Loss

Bagging vs Boosting

Related

Graph View

Backlinks