DINO

I need to actually understand how DINO works https://github.com/facebookresearch/dino

Takes inspiration from BYOL, operates with different similarity matching loss, but exact same architecture for student-teacher.

Like you know how object detection feels basically solved?

Resources

https://www.youtube.com/watch?v=oGTasd3cliM

Walkthrough (CS231n 2025 Lec 12)

DINO = “self-DIstillation with NO labels”. No explicit negatives, no contrastive loss — a teacher-student setup where the student learns to match the teacher’s output distribution on different views of the same image.

Architecture

Student $g_{θ_{s}}$ and teacher $g_{θ_{t}}$ — same architecture (ViT in the DINO paper), different parameters.
Teacher is an EMA of the student: $θ_{t} \leftarrow τ \cdot θ_{t} + (1 - τ) \cdot θ_{s}$ . No backprop into the teacher — gradient sg (stop-grad) on the teacher branch.
Both produce a $K$ -dimensional output that gets softmaxed into a probability distribution over $K$ “prototypes” (no classes — the $K$ dimensions are just a learned codebook).

Loss

Draw two augmented views $x_{1}, x_{2}$ of the same image. Cross-entropy each way:

$L = H (t_{1}, s_{2}) /2 + H (t_{2}, s_{1}) /2, H (t, s) = - t^{T} lo g s$

where $t_{i} = softmax ((g_{θ_{t}} (x_{i}) - C) / τ_{t})$ (teacher, centered + sharp) and $s_{i} = softmax (g_{θ_{s}} (x_{i}) / τ_{s})$ (student).

Why it doesn’t collapse

Without care, the student can collapse to outputting a constant (matching a constant teacher trivially). DINO prevents this with two tricks on the teacher:

Centering. Subtract a running mean $C$ from teacher logits, $C \leftarrow m \cdot C + (1 - m) \cdot mean ([t_{1}, t_{2}])$ . Prevents any single dimension from dominating.
Sharpening. Use a low teacher temperature $τ_{t} ≪ τ_{s}$ . Makes the teacher’s output peaked, which pushes the student toward sharper, non-uniform distributions.

Centering alone → uniform collapse. Sharpening alone → one-dim collapse. Together they balance.

Pseudocode (Lec 12 slide 105)

gt.params = gs.params
for x in loader:
    x1, x2 = augment(x), augment(x)
    s1, s2 = gs(x1), gs(x2)
    t1, t2 = gt(x1), gt(x2)
    loss = H(t1, s2)/2 + H(t2, s1)/2
    loss.backward()
    update(gs)                                    # SGD
    gt.params = l * gt.params + (1-l) * gs.params # EMA
    C = m * C + (1-m) * cat([t1, t2]).mean(dim=0) # center
 
def H(t, s):
    t = t.detach()                         # stop-grad
    s = softmax(s / tps, dim=1)
    t = softmax((t - C) / tpt, dim=1)      # center + sharpen
    return -(t * log(s)).sum(dim=1).mean()

Emergent property: unsupervised segmentation

The headline result: the [CLS] token’s self-attention on the last layer of a DINO-trained ViT produces clean object segmentation masks — without any segmentation supervision. ViT 8×8 patches trained with DINO attend exactly to the salient object in the scene. Supervised ViTs don’t do this.

Performance: ViT-Base with DINO → 80.1% ImageNet linear eval (strong). Small ViT with DINO → 78.3% top-1 via k-NN classification of frozen features (no linear head needed).

DINO v2 (Oquab et al. 2023)

Scaled recipe — bigger model, bigger curated dataset. Key emergent property: patch-level features from DINOv2 cluster into semantic parts across images (PCA of patch features matches “wings”, “body”, “wheels” across different pose/style/object instances).

Source

CS231n 2025 Lec 12 slides ~103–106 (DINO architecture diagram, loss, centering+sharpening, PyTorch pseudocode, DINOv2 PCA figure).

DINOv2

🛠️ Steven Gong

Table of Contents

DINO

Walkthrough (CS231n 2025 Lec 12)

Architecture

Loss

Why it doesn’t collapse

Pseudocode (Lec 12 slide 105)

Emergent property: unsupervised segmentation

DINO v2 (Oquab et al. 2023)

Source

Next

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

DINO

Walkthrough (CS231n 2025 Lec 12)

Architecture

Loss

Why it doesn’t collapse

Pseudocode (Lec 12 slide 105)

Emergent property: unsupervised segmentation

DINO v2 (Oquab et al. 2023)

Source

Related

Next

Graph View

Backlinks