🛠️ Steven Gong

Search

Contrastive Language-Image Pre-Training (CLIP)
Related

Aug 26, 2025, 1 min read

Contrastive Language-Image Pre-Training (CLIP)

CLIP was trained on a huge number of image-caption pairs from the internet.

Resources:

https://arxiv.org/pdf/2103.00020
https://medium.com/one-minute-machine-learning/clip-paper-explained-easily-in-3-levels-of-detail-61959814ad13

CLIP is a model for telling you how well a given image and a given text caption fit together.

Related

SigLIP

Graph View

Backlinks

Contrastive Learning
Feature-wise Linear Modulation (FiLM)
Segment Anything (SAM)
Sigmoid Loss for Language Image Pre-Training (SigLIP)
Vision Transformer (ViT)
Vision-Language Model (VLM)

Created with Quartz, © 2025

Blog
LinkedIn
Twitter
GitHub