Differential Privacy

Differential Privacy (DP) is a formal, worst-case guarantee that a randomized algorithm’s output distribution barely changes when any single data point is added or removed.

The de facto standard for private data analysis, adopted by the US Census, Apple, Google, and others. Informally: “The outcome is essentially the same whether or not your data was in the dataset.” Introduced by Dwork, McSherry, Nissim, Smith (2006).

Why a worst-case guarantee?

Anonymization keeps failing because auxiliary data re-identifies individuals. DP gives a mathematical guarantee that no such side channel exists.

Why We Need It: Training Data Leaks

Netflix Prize (2006–2009): anonymized (user, movie, rating, date) tuples were re-identified by matching against public IMDb profiles. Led to a class-action lawsuit and cancellation of the sequel
Large models memorize: LLMs can be coerced to reproduce training-set strings verbatim (Carlini et al. 2021). Diffusion models memorize training images (Carlini et al. 2023)
Implications: personal information leakage, copyright (Getty Images vs. Stability AI)

Formal Definition

$M : D^{n} \to R$ is $(ε, δ)$ -differentially private if, for all neighboring datasets $X, X^{'}$ (differing on one entry), and all output sets $S \subseteq R$ : $Pr [M (X) \in S] \leq e^{ε} Pr [M (X^{'}) \in S] + δ$

where:

$ε \approx 1$ and $δ < 1/ n$ are typical
Worst-case guarantee over datasets and outputs
Symmetric: swap $X, X^{'}$ for the same bound
$M$ must be randomized; no deterministic algorithm is DP

Intuition

$ε$ caps how much one person’s data can shift the output distribution: on a log scale, the ratio of “you’re in” vs “you’re out” probabilities is bounded by $ε$ . $δ$ is the slack probability that the bound fails entirely, typically set tiny (smaller than $1/ n$ so it can’t encode “leak one random person”). Smaller $ε$ means stronger privacy and noisier output; $ε \to \infty$ recovers the non-private algorithm.

Deterministic $M$ can’t be DP: if $M (X) \neq = M (X^{'})$ on any neighboring pair, one output has probability 1 and the other 0, and no finite $ε$ bounds the ratio.

What DP Does and Doesn’t Mean

Protects against membership inference and linkage attacks
Preserves aggregate statistics (“smoking causes cancer” still provable)
Information-theoretic, doesn’t assume bounded adversary compute
Not a guarantee for tasks that need to identify a specific individual
Doesn’t prevent group-level inferences about your demographic

Core Properties

Post-processing. If $M (X)$ is $(ε, δ)$ -DP, then $f (M (X))$ is also $(ε, δ)$ -DP for any $f$ . You can’t “un-privatize” through processing. Once information is squeezed through the noisy bottleneck, no downstream computation (not even an adversary’s best inference) can recover more than the released output already carries.

Group privacy. If $X$ and $X^{'}$ differ in $k$ entries: $Pr [M (X) \in S] \leq e^{k ε} Pr [M (X^{'}) \in S] + δ$

Composition. Given $M_{1}, \dots, M_{k}$ each $(ε, δ)$ -DP:

Basic: $M = (M_{1}, \dots, M_{k})$ is $(k ε, k δ)$ -DP
Advanced composition: $(O (k ε lo g 1/ δ^{'}), k δ + δ^{'})$ -DP, much tighter for large $k$

Every query you run leaks a little. Basic composition is the pessimistic “worst-case every time” sum. Advanced composition notices that privacy losses are random (some queries barely leak), so the expected total grows like $k$ , not $k$ . This is what makes DP-SGD with thousands of steps viable.

Gaussian Mechanism

The workhorse for numerical queries. Define the $ℓ_{2}$ sensitivity of a function $f$ : $Δ_{2}^{(f)} = max_{X \sim X^{'}} ∥ f (X) - f (X^{'}) ∥_{2}$

(If $∥ f (X) ∥_{2} \leq C$ always, then $Δ_{2}^{(f)} \leq 2 C$ .)

Sensitivity is “how much can one person change the answer.” A bounded function has bounded sensitivity, so swapping one person’s data moves the output by at most $2 C$ . This is why DP-SGD clips per-example gradients to norm $C$ : unbounded gradients would mean unbounded sensitivity and no amount of noise suffices.

Mechanism: $M (X) = f (X) + (Y_{1}, \dots, Y_{k}), Y_{i} \sim N (0, Δ_{2}^{2} / ε^{2})$

where:

$f$ is the target query
$Y_{i}$ is per-coordinate Gaussian noise scaled to sensitivity

This is $(ε, δ)$ -DP for appropriate parameters.

Intuition

The true answer $f (X)$ could move by up to $Δ_{2}$ if any single person’s data changes. Adding Gaussian noise of that same scale means “the signal from one person is drowned out by the noise.” Scaling by $1/ ε$ tunes the tradeoff: tighter privacy demands louder noise.

Why Gaussian specifically?

Its tails match the $ε$ -vs- $δ$ tradeoff and it composes cleanly via moments accounting. Laplace noise gives pure $(ε, 0)$ -DP but composes worse for many queries.

DP-SGD

Standard SGD leaks information about individual training points through its gradient updates. A large gradient on one example says “this point is unusual and here’s roughly how.” DP-SGD makes training differentially private by clipping each per-example gradient (bounding one person’s influence) and then adding Gaussian noise (hiding the rest):

Sample a “lot” of points by including each with probability $L / n$ (Poisson sampling)
Compute the per-example gradient $\nabla ℓ (θ_{t}, x, y)$ for each point and clip its $ℓ_{2}$ norm to $\leq C$
Average the clipped gradients and add Gaussian noise $N (0, σ^{2} C^{2} I)$
Take a gradient step on the noisy average
Repeat

References: Song-Chaudhuri-Sarwate ‘13, Bassily-Smith-Thakurta ‘14, Abadi et al. ‘16.

Privacy Accounting

Subsampling amplification: each step is $ε L / n$ instead of $ε$ , sampling dilutes the privacy cost
$k$ steps by advanced composition: $\approx ε k L / n$
Moments accountant (Abadi et al. ‘16): tighter analysis that tracks privacy loss moments, standard in modern DP-SGD implementations

Practical Performance

CIFAR-10: 99% non-private → 69% private (big drop). Private vision is hard from scratch
Public pretraining helps enormously: fine-tune a public model privately rather than training from scratch
LLM fine-tuning: RoBERTa-Large at $ε = 6.7$ loses only ~3% on GLUE vs non-private (Yu et al. 2022)
Counterintuitive: bigger models often do better privately despite noise scaling with parameter count $p$ . Private fine-tuning benefits from large-scale pretraining. The public pretraining does the hard representation-learning work; the private fine-tune only has to nudge a small, well-conditioned subspace, so the noise budget goes further

Slides from CS480 lec17.

🛠️ Steven Gong

Table of Contents

Differential Privacy

Why We Need It: Training Data Leaks

Formal Definition

What DP Does and Doesn’t Mean

Core Properties

Gaussian Mechanism

DP-SGD

Privacy Accounting

Practical Performance

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Differential Privacy

Why We Need It: Training Data Leaks

Formal Definition

What DP Does and Doesn’t Mean

Core Properties

Gaussian Mechanism

DP-SGD

Privacy Accounting

Practical Performance

Related

Graph View

Backlinks