They use a VQ-VAE, and then train to predict in the latent space.

Why did they not use a VLM and do predictions in the latent space?