Vision Transformer

Adaptive Layer Normalization (AdaLN)

AdaLN is a ViT architecture designed for multidomain learning, particularly in the context of extracting building information from satellite and street view images

Used in the pi0 paper and DiT.

Links:

Normally, when you add a new input modality (e.g., image embeddings, action tokens, etc.) into a pretrained LLM backbone, you need to inject this new information in a way that doesn’t disrupt the pretrained distribution too much.

adaLN-Zero introduces learnable scale and shift parameters into LayerNorm that are initialized at zero.

  • Scale = 0
  • Shift = 0

This means at the very start of training, the new conditioning path contributes nothing (zero effect) to the model’s computation. The model behaves like the pretrained LLM.