Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

BEIT-3 is a general-purpose multimodal foundation model.

🛠️ Steven Gong