Learning Transferable Visual Models From Natural Language Supervision (CLIP)

CLIP was trained on a huge number of image-caption pairs from the internet.

Implementation details:

Image Encoder is implemented with ResNet-50 / Vision Transformer
Text encoder is implemented with text transformer
Contrastive Loss

Pseudocode

# image_encoder - ResNet or Vision Transformer
# text_encoder  - CBOW or Text Transformer
 
# I[n, h, w, c]   - minibatch of aligned images
# T[n, l]         - minibatch of aligned texts
# W_i[d_i, d_e]   - learned projection of image to embedding
# W_t[d_t, d_e]   - learned projection of text to embedding
# t               - learned temperature parameter
 
# extract feature representations of each modality
I_f = image_encoder(I)  # [n, d_i]
T_f = text_encoder(T)   # [n, d_t]
 
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
 
# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)
 
# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss   = (loss_i + loss_t) / 2
 
# Figure 3. Numpy-like pseudocode for the core of an implementation of CLIP

“At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes.”

Yes, so you take the embeddings, and then you take the image, and the image that has the highest value will be the correct answer

Resources:

https://medium.com/one-minute-machine-learning/clip-paper-explained-easily-in-3-levels-of-detail-61959814ad13

CLIP is a model for telling you how well a given image and a given text caption fit together.

Blue - Initially trying a CNN text encoder + transformer decoder, predicting the caption of the image
Orange - Using bag of words, see Bag of Tricks for Efficient Text Classification. Essentially at eval time, it will be presented with all the correct classes, and try to predict the correct one
Green - CNN text encoder + transformer encoder, simply using a contrastive loss

Follow up

SigLIP

🛠️ Steven Gong

Table of Contents

Learning Transferable Visual Models From Natural Language Supervision (CLIP)

Follow up

Graph View

Backlinks