Decision Tree

Boosting

Boosting combines many weak learners (55% accuracy, barely better than random) into a strong learner (90%+) by iteratively reweighting the dataset.

Why reweight instead of resample like bagging?

Bagging reduces variance by averaging independent-ish learners. Boosting targets bias: each new learner focuses on the mistakes of the previous ones, so the committee as a whole can represent functions none of them could alone.

Unlike bagging, boosting is inherently sequential.

Iterative process:

  1. Train a classifier on the current weights
  2. Downweight points it gets right, upweight points it gets wrong
  3. Train the next classifier on the new weighted dataset
  4. Final prediction: weighted vote of all learners

Detour: Online Learning with Experts (Hedge)

Before AdaBoost, consider the prediction-with-experts setting. For :

  1. Algorithm chooses a distribution over experts
  2. Adversary reveals loss vector
  3. Algorithm incurs loss

Goal: minimize , competing with the best single expert in hindsight .

Hedge Algorithm

Parameter :

  1. Init
  2. For :
    • Suffer loss
    • Update (downweight bad experts)

Guarantee:

where:

  • is the learning-rate-like parameter
  • is the number of experts, the number of rounds

Average regret goes to 0 as . Extremely general, used in linear programming, game theory, etc.

AdaBoost: Hedge on Datapoints

The insight: instead of treating classifiers as experts, treat datapoints as experts. Upweight points that haven’t been learned yet.

Intuition

Each round, the algorithm stares at which points the current committee still botches and says β€œnext learner, focus here.” Easy points (already correct) get turned down; hard points get turned up until some weak learner picks up on them. You end up with a committee where each member specializes in a different slice of the hard cases, and the weighted vote stitches the specialists together.

  1. Init weights
  2. For :
    1. Normalize:
    2. Run WeakLearn on the weighted training set to get classifier
    3. Compute weighted error (should be by the weak learner guarantee)
    4. Let and update (upweight points misclassified by )
  3. Final classifier: weighted vote,

Training Error Bound

If each weak learner has error , then training error decays exponentially:

where:

  • is the weak learner edge over random guessing
  • is the number of boosting rounds

Alternate View: Gradient Descent on Exponential Loss

AdaBoost can be derived as coordinate descent on . This perspective underlies Gradient Boosted Trees, which generalize to any differentiable loss.

Intuition

Boosting is greedy gradient descent in function space. Each weak learner fits the residual (the negative gradient of the loss at the current ensemble’s predictions) and gets added with a small step size. The weight-updating view and the gradient-descent view are the same algorithm seen from two angles: β€œfocus on what we got wrong” is literally β€œmove in the direction of the gradient.”

Overfitting behavior

AdaBoost surprisingly often keeps improving test error even after training error hits 0, as the margin keeps growing. With very expressive base learners, it can and does overfit.

Viola-Jones face detection (2001): start with very simple Haar-like feature classifiers, use boosting to combine into a cascade. First real-time face detector.

Bagging vs Boosting

BaggingBoosting
GoalReduce varianceReduce bias
DataBootstrap (parallel)Reweighted (sequential)
Parallel?YesNo, inherently sequential
Base learnerFull-depth treesWeak (decision stumps)

Both are simple and flexible with any base learner, but both lose some interpretability compared to a single tree.

Slides: http://www.gautamkamath.com/courses/CS480-fa2025-files/lec8.pdf