Layer Normalization
Layer normalization applies normalization across features instead of across batches in the case of BatchNorm.
First introduced from CS480 teacher at UWaterloo while learning about transformers.
Why layer norm?
LayerNorm was originally proposed as an alternative to Batch Normalization that doesnāt depend on batch statistics and works well for recurrent networks.
Why do transformers use layer norm?
Resources
- Original paperĀ Layer Normalization
- Pytorch https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html
- https://www.pinecone.io/learn/batch-layer-normalization/
In NLP tasks, the batch size is often very small. Batch normalization might not be a good choice. Instead, layer norm is used.
Formula:
- and are learnable parameters
- is calculated via the biased estimator, equivalent to torch.var(input, unbiased=False).
- is for numerical stability
This formula is IDENTICAL to BatchNorm lol
We are just doing standardization across features, as opposed to across a batch.
The number of parameters learned remaining the same.
LayerNorm operates per token, meaning it normalizes across the D dimensions for each position in the sequence.
I was confused about this
gamma and beta of size d (the feature dimension), shared across all tokens/batch samples ā just like in batch norm, itās āone gamma per feature.ā
- Even though you are normalizing across the embedding dimension, you will learn a and for each feature.
- Itās not a 2 parameters per axis of normalization, but rather 2 parameters per feature dimension
The mean and standard-deviation are calculated over the lastĀ DĀ dimensions, whereĀ DĀ is the dimension.
- From Pytorch link