Multimodal Large Language Model (MLLM) VLMs fit under this. They have a huge reading list: https://mllm2024.github.io/CVPR2024/