From the repo materials, UnifoLM-WMA’s world model is a video generation backbone (DynamiCrafter-style diffusion) that’s fine-tuned on Open-X to predict future interaction videos from image + text; then an action head (“policy enhancement”) turns those predictions/latents into future robot actions/poses. The README and project page credit DynamiCrafter (video diffusion) for the world model and Diffusion Policy + ACT + HPT for the action/pose side, which strongly indicates the head is implemented as a policy module in the style of Diffusion Policy (1D U-Net denoiser over action sequences) and/or ACT (Action-Chunking Transformer)—not a VQGAN tokenizer/decoder