Layer Normalization

Layer normalization applies normalization across features instead of across batches in the case of BatchNorm.

First introduced from CS480 teacher at UWaterloo while learning about transformers.

Why layer norm?

LayerNorm was originally proposed as an alternative to Batch Normalization that doesnā€™t depend on batch statistics and works well for recurrent networks.

Why do transformers use layer norm?

Resources

In NLP tasks, the batch size is often very small. Batch normalization might not be a good choice. Instead, layer norm is used.

Formula:

  • and are learnable parameters
  • is calculated via the biased estimator, equivalent to torch.var(input, unbiased=False).
  • is for numerical stability

This formula is IDENTICAL to BatchNorm lol

We are just doing standardization across features, as opposed to across a batch.

The number of parameters learned remaining the same.

LayerNorm operates per token, meaning it normalizes across the D dimensions for each position in the sequence.

I was confused about this

gamma and beta of size d (the feature dimension), shared across all tokens/batch samples ā€” just like in batch norm, itā€™s ā€œone gamma per feature.ā€

  • Even though you are normalizing across the embedding dimension, you will learn a and for each feature.
  • Itā€™s not a 2 parameters per axis of normalization, but rather 2 parameters per feature dimension

The mean and standard-deviation are calculated over the lastĀ DĀ dimensions, whereĀ DĀ is the dimension.

  • From Pytorch link