Visual Instruction Tuning (LLaVA) Project page: https://llava-vl.github.io/ Similar to PaliGemma, they also use CLIP as the visual encoder Follow up paper: https://arxiv.org/abs/2310.03744