Adversarial Robustness
Adversarial robustness is the study of making ML models resistant to imperceptible input perturbations that flip predictions, and of mounting such attacks.
ML models are surprisingly brittle: tiny perturbations can flip the prediction while leaving the human-perceived label unchanged.
Intuition
In high-dimensional input space, decision boundaries sit close to almost every data point. A nudge of a few pixels in a carefully chosen direction (the gradient) is enough to cross the boundary even though the image still looks the same to a human. Robustness is about forcing the boundary to back away from the data.
Why study adversaries if inputs look identical?
To be a good defender you have to be a good attacker. The same optimization procedure produces both.
See also Adversarial Machine Learning for the broader field. Reference: Adversarial Robustness, Theory and Practice by Zico Kolter and Aleksander Madry.
Why ML is Not Naturally Robust
Standard ML assumes train and test data are i.i.d. from the same distribution. That breaks under:
- Model misspecification: true process is not in the model family
- Measurement error / dirty data
- Adversarial manipulation: the focus of this note
Toy example (non-robust mean): estimating from with the sample mean . A single outlier at can shift by . Robust alternatives: prune outliers, or use the median (a robust statistic with bounded influence of any single sample).
One poisoned sample can drag the mean arbitrarily far, so an attacker who owns even a single data point owns the estimator. The median can’t be pulled past the middle of the sorted samples, no matter how large the outlier, which is what “bounded influence” buys you.
Adversarial Example
Given trained model and test input , an adversarial example satisfies:
- : same true label according to a human
- : the model is fooled
Human perception is hard to encode, so we use -distance as a proxy: . Most common are (sparse perturbations), , and (pixel-level bounded).
Geometric picture
The ball is a hypercube: every pixel is free to wiggle up to independently. The ball is a sphere: a total “energy” budget you can concentrate in a few pixels or spread across many. says “change at most pixels, but to anything you want” (sticker attacks).
Other distances: Wasserstein, translations, rotations, resizing.
Attack Taxonomy
- White-box vs black-box: does the attacker know ?
- Targeted vs untargeted: force a specific wrong label , or any wrong label?
Attacker: Fast Gradient Sign Method (FGSM)
FGSM crafts with via a single gradient step:
It takes the biggest linear step that stays in the ball:
One step, one gradient computation. Fast, sometimes surprisingly effective.
Linearize the loss, then push every pixel by exactly in whichever direction raises the loss. Sign-of-gradient is the -optimal direction because, under the cube constraint, each coordinate’s contribution is maximized independently at .
Projected Gradient Descent (PGD)
Multi-step FGSM, the de facto strongest first-order attack.
Project back into the ball at each step (clip each coordinate to ).
FGSM assumes the loss is linear; real losses curve, so one step undershoots. PGD just runs many small FGSM steps and snaps back to the ball after each, climbing the loss surface while staying inside the allowed budget.
Untargeted vs Targeted
Defense: Adversarial Training (Madry et al.)
Instead of minimizing expected loss, minimize robust expected loss:
This is a saddle-point (min-max) problem. To be a good defender you have to be a good attacker.
Intuition
Instead of training on the clean point , train on the worst point inside its -ball. You’re flattening the loss landscape around each training example, so no small perturbation can find a loss spike. The inner max finds the worst nearby example; the outer min teaches the model to handle it.
Training loop:
- Sample minibatch
- For each , run PGD to compute
- Gradient step:
- Repeat
Expensive, each training step now runs a PGD inner loop.
The State of Play
Attacks remain more effective than defenses. A famous result: 7 out of 9 defenses submitted to ICLR 2018 were broken within a day of being accepted. Certified defenses (randomized smoothing, interval bound propagation) give provable but usually weak guarantees.
Backdoor Attacks
A different threat model: the attacker modifies the training data to plant a trigger. At test time, inputs containing the trigger are misclassified; clean inputs behave normally. Related to data poisoning.
Slides from CS480 lec16.