Sigmoid Loss for Language Image Pre-Training (SigLIP)

SigLIP is an improved version of CLIP which introduces sigmoid-based Contrastive Loss instead of the traditional softmax-based contrastive loss used in CLIP.

From CLIP

Implementation details:

Image Encoder is implemented with a Vision Transformer
Text Encoder is implemented with a Transformer
Sigmoid-based Contrastive Loss.

It uses separate image and text encoders to generate representations for both modalities.

Resources

https://huggingface.co/docs/transformers/en/model_doc/siglip

🛠️ Steven Gong

Sigmoid Loss for Language Image Pre-Training (SigLIP)

Graph View

Backlinks