Layer Normalization

Layer normalization applies normalization across features instead of across batches in the case of BatchNorm.

First introduced from CS480 teacher at UWaterloo while learning about transformers.

Why layer norm?

LayerNorm was originally proposed as an alternative to Batch Normalization that doesn’t depend on batch statistics and works well for recurrent networks.

Why do transformers use layer norm?

https://stats.stackexchange.com/questions/474440/why-do-transformers-use-layer-norm-instead-of-batch-norm

Resources

Original paper Layer Normalization
Pytorch https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html
https://www.pinecone.io/learn/batch-layer-normalization/

Refer to Transformer.

In NLP tasks, the batch size is often very small. Batch normalization might not be a good choice. Instead, layer norm is used.

Formula: $y = \frac{x - E [ x ]}{Va r [ x ] + ϵ} * γ + β$

$γ$ and $β$ are learnable parameters
$Va r [x]$ is calculated via the biased estimator, equivalent to torch.var(input, unbiased=False).
$ϵ$ is for numerical stability

This formula is IDENTICAL to BatchNorm lol

We are just doing standardization across features, as opposed to across a batch.

The number of parameters learned remaining the same.

Source: https://www.pinecone.io/learn/batch-layer-normalization/

LayerNorm operates per token, meaning it normalizes across the D dimensions for each position in the sequence.

I was confused about this

gamma and beta of size d (the feature dimension), shared across all tokens/batch samples — just like in batch norm, it’s “one gamma per feature.”

Even though you are normalizing across the embedding dimension, you will learn a $γ$ and $β$ for each feature.
It’s not a 2 parameters per axis of normalization, but rather 2 parameters per feature dimension

The mean and standard-deviation are calculated over the last D dimensions.

From Pytorch link

Batch Normalization

🛠️ Steven Gong

Table of Contents

Layer Normalization

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Layer Normalization

Related

Graph View

Backlinks