Bootstrap Aggregating (Bagging)

Bagging is bootstrap sampling plus aggregation: train many copies of a high-variance base learner (like decision trees) on resampled datasets and average their predictions to reduce variance.

Bootstrap sampling intuition

Averaging $B$ predictors trained on independent datasets of size $n$ reduces variance by a factor of $B$ . We don’t have $B$ independent datasets, so we cheat and resample with replacement from the one we have. Not truly independent, but works well in practice.

The Variance Argument

Suppose we want to estimate $μ$ given $X_{1}, \dots, X_{n} \sim N (μ, σ^{2})$ . Using the empirical mean $\overset{μ}{^} = \frac{1}{n} \sum X_{i}$ :

$E [\overset{μ}{^}] = μ, Var (\overset{μ}{^}) = \frac{σ ^{2}}{n}$

If we had $B n$ points, we could form $B$ disjoint sets $S_{1}, \dots, S_{B}$ of size $n$ , compute $\overset{μ}{^}^{(j)}$ for each, and average:

$\overset{μ}{^}^{(avg)} = \frac{1}{B} \sum_{j} \overset{μ}{^}^{(j)}, Var (\overset{μ}{^}^{(avg)}) = \frac{σ ^{2}}{n B}$

where:

$B$ is the number of disjoint subsets
$\overset{μ}{^}^{(j)}$ is the empirical mean on subset $S_{j}$

Intuition

Averaging $B$ independent noisy estimates cuts variance by $1/ B$ : the errors in different copies point in random directions and cancel, while the signal adds. The trees don’t need to be good, they need to be diverse and uncorrelated. Bagging’s whole job is to manufacture that diversity from a single fixed dataset.

Variance drops by factor $B$ , but we needed $B \times$ more data.

Bootstrap Sampling

Given a dataset of size $n$ , create $B$ datasets of size $n$ by drawing $n$ samples with replacement from the original.

Example: from ${X_{1}, X_{2}, X_{3}, X_{4}, X_{5}}$ :

$S_{1} = {X_{3}, X_{4}, X_{1}, X_{1}, X_{4}}$
$S_{2} = {X_{5}, X_{5}, X_{3}, X_{1}, X_{2}}$
$⋮$

Each bootstrap sample contains ~ $(1 - 1/ e) \approx 63%$ of the unique original points. The remaining ~37% “out-of-bag” points give you a free held-out set per tree: you can evaluate tree $j$ on the points it never saw.

Because the samples aren’t independent, the variance doesn’t literally drop by $B$ , but empirically it still drops substantially.

Algorithm

Bootstrap-sample $B$ datasets $S_{1}, \dots, S_{B}$ of size $n$
Train a classifier $\hat{f}^{(j)}$ on each $S_{j}$
Aggregate:
- Regression: $\hat{f} (x) = \frac{1}{B} \sum_{j} \hat{f}^{(j)} (x)$
- Classification: $\hat{f} (x) =$ majority vote of $\hat{f}^{(j)} (x)$

Where Bagging Helps

Bagging helps most with high-variance base learners. Decision trees are the classic example: tiny data perturbations produce very different trees. Bagging low-variance learners (like a well-regularized linear model) gains little.

Random forests extend bagging on decision trees with feature subsampling at each split, decorrelating the trees further.

Bagging vs Boosting

Bagging: parallel, reduces variance, uses the full (bootstrapped) data on each learner
Boosting: sequential, reduces bias, reweights data to focus on mistakes

Slides: http://www.gautamkamath.com/courses/CS480-fa2025-files/lec8.pdf

🛠️ Steven Gong

Table of Contents

Bootstrap Aggregating (Bagging)

The Variance Argument

Bootstrap Sampling

Algorithm

Where Bagging Helps

Bagging vs Boosting

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Bootstrap Aggregating (Bagging)

The Variance Argument

Bootstrap Sampling

Algorithm

Where Bagging Helps

Bagging vs Boosting

Related

Graph View

Backlinks