N-Gram

The goal is to generate text. We can use a probabilistic model.

Andrej Karpathy was the first to introduce me this idea.

Resources

Example bigram: P(“I saw a van”) = P(“I”) x P(“saw” | “I”) x P(“a” | “I saw”) x P(“van” | “I saw a”)

But how feasible is this?

Not really. We use a smaller limit: the n-gram
Limit of $n$ words Basic Idea: Probability of next word only depends on the previous (N – 1) words

$P (w_{k} ∣ w_{1}, w_{2}, \dots w_{k - 1}) \approx P (w_{k} ∣ w_{k - N + 1}, w_{k - N + 2}, \dots, w_{k - 1})$

N = 1 : Unigram Model- $P (w_{1}, w_{2}, w_{3}, \dots) = P (w_{1}) P (w_{2}) \dots P (w_{k})$
N = 2 : Bigram Model - $P (w_{1}, w_{2}, w_{3}, \dots) = P (w_{1}) P (w_{2} ∣ w_{1}) \dots P (w_{k} ∣ w_{k - 1})$

State of the art rarely goes above 5-gram.

We apply laplace smoothing for the bigram probabilities.

🛠️ Steven Gong