Adversarial Robustness

Adversarial robustness is the study of making ML models resistant to imperceptible input perturbations that flip predictions, and of mounting such attacks.

ML models are surprisingly brittle: tiny perturbations can flip the prediction while leaving the human-perceived label unchanged.

Intuition

In high-dimensional input space, decision boundaries sit close to almost every data point. A nudge of a few pixels in a carefully chosen direction (the gradient) is enough to cross the boundary even though the image still looks the same to a human. Robustness is about forcing the boundary to back away from the data.

Why study adversaries if inputs look identical?

To be a good defender you have to be a good attacker. The same optimization procedure produces both.

See also Adversarial Machine Learning for the broader field. Reference: Adversarial Robustness, Theory and Practice by Zico Kolter and Aleksander Madry.

Why ML is Not Naturally Robust

Standard ML assumes train and test data are i.i.d. from the same distribution. That breaks under:

Model misspecification: true process is not in the model family
Measurement error / dirty data
Adversarial manipulation: the focus of this note

Toy example (non-robust mean): estimating $μ$ from $X_{1}, \dots, X_{n} \sim N (μ, 1)$ with the sample mean $\overset{μ}{^} = \frac{1}{n} \sum X_{i}$ . A single outlier at $\approx 100 n + μ$ can shift $\overset{μ}{^}$ by $> 100$ . Robust alternatives: prune outliers, or use the median (a robust statistic with bounded influence of any single sample).

One poisoned sample can drag the mean arbitrarily far, so an attacker who owns even a single data point owns the estimator. The median can’t be pulled past the middle of the sorted samples, no matter how large the outlier, which is what “bounded influence” buys you.

Adversarial Example

Given trained model $f_{θ}$ and test input $x$ , an adversarial example $x^{'}$ satisfies:

$x \approx x^{'}$ : same true label according to a human
$f_{θ} (x) \neq = f_{θ} (x^{'})$ : the model is fooled

Human perception is hard to encode, so we use $ℓ_{p}$ -distance as a proxy: $∥ x - x^{'} ∥_{p} \leq ε$ . Most common are $p = 0$ (sparse perturbations), $p = 2$ , and $p = \infty$ (pixel-level bounded).

Geometric picture

The $ℓ_{\infty}$ ball is a hypercube: every pixel is free to wiggle up to $ε$ independently. The $ℓ_{2}$ ball is a sphere: a total “energy” budget you can concentrate in a few pixels or spread across many. $ℓ_{0}$ says “change at most $k$ pixels, but to anything you want” (sticker attacks).

Other distances: Wasserstein, translations, rotations, resizing.

Attack Taxonomy

White-box vs black-box: does the attacker know $θ$ ?
Targeted vs untargeted: force a specific wrong label $c$ , or any wrong label?

Attacker: Fast Gradient Sign Method (FGSM)

FGSM crafts $x^{'} = x + δ$ with $∥ δ ∥_{\infty} \leq ε$ via a single gradient step: $δ^{'} = ar g max_{∥ δ ∥_{\infty} \leq ε} ℓ (x + δ, y, θ)$

It takes the biggest linear step that stays in the $ℓ_{\infty}$ ball: $δ^{*} = ε \cdot sign (\nabla_{δ} ℓ (x + δ, y, θ))$

One step, one gradient computation. Fast, sometimes surprisingly effective.

Linearize the loss, then push every pixel by exactly $ε$ in whichever direction raises the loss. Sign-of-gradient is the $ℓ_{\infty}$ -optimal direction because, under the cube constraint, each coordinate’s contribution is maximized independently at $\pm ε$ .

Projected Gradient Descent (PGD)

Multi-step FGSM, the de facto strongest first-order attack.

$δ^{(t + 1)} = Proj_{∥ \cdot ∥_{\infty} \leq ε} (δ^{(t)} + η \cdot sign (\nabla_{δ} ℓ (x + δ^{(t)}, y, θ)))$

Project back into the $ℓ_{\infty}$ ball at each step (clip each coordinate to $[- ε, ε]$ ).

FGSM assumes the loss is linear; real losses curve, so one step undershoots. PGD just runs many small FGSM steps and snaps back to the ball after each, climbing the loss surface while staying inside the allowed budget.

Untargeted vs Targeted

$Untargeted: max_{δ} ℓ (x + δ, y, θ) s.t. ∥ δ ∥_{\infty} \leq ε$ $Targeted to c : max_{δ} [ℓ (x + δ, y, θ) - ℓ (x + δ, c, θ)] s.t. ∥ δ ∥_{\infty} \leq ε$

Defense: Adversarial Training (Madry et al.)

Instead of minimizing expected loss, minimize robust expected loss: $min_{θ} E_{(x, y) \sim p} [max_{∥ δ ∥_{\infty} \leq ε} ℓ (x + δ, y, θ)]$

This is a saddle-point (min-max) problem. To be a good defender you have to be a good attacker.

Intuition

Instead of training on the clean point $x$ , train on the worst point inside its $ε$ -ball. You’re flattening the loss landscape around each training example, so no small perturbation can find a loss spike. The inner max finds the worst nearby example; the outer min teaches the model to handle it.

Training loop:

Sample minibatch $B$
For each $(x_{i}, y_{i}) \in B$ , run PGD to compute $δ_{i}^{*} = ar g max_{∥ δ_{i} ∥_{\infty} \leq ε} ℓ (x_{i} + δ_{i}, y_{i}, θ)$
Gradient step: $θ \leftarrow θ - \frac{η}{∣ B ∣} \sum_{i} \nabla_{θ} ℓ (x_{i} + δ_{i}^{*}, y_{i}, θ)$
Repeat

Expensive, each training step now runs a PGD inner loop.

The State of Play

Attacks remain more effective than defenses. A famous result: 7 out of 9 defenses submitted to ICLR 2018 were broken within a day of being accepted. Certified defenses (randomized smoothing, interval bound propagation) give provable but usually weak guarantees.

Backdoor Attacks

A different threat model: the attacker modifies the training data to plant a trigger. At test time, inputs containing the trigger are misclassified; clean inputs behave normally. Related to data poisoning.

Slides from CS480 lec16.

🛠️ Steven Gong

Table of Contents

Adversarial Robustness

Why ML is Not Naturally Robust

Adversarial Example

Attack Taxonomy

Attacker: Fast Gradient Sign Method (FGSM)

Projected Gradient Descent (PGD)

Untargeted vs Targeted

Defense: Adversarial Training (Madry et al.)

The State of Play

Backdoor Attacks

Graph View

Backlinks