Precision, Recall and F1

Blogs

Book a call

Blogs

Book a call

All Blogs

N-Gram Models, Perplexity, and the Evolution of Language Modeling

FUNDAMENTALS

Surender Singh

Oct 1, 2025

The Quest for the Next Word

Predicting the future is a notoriously difficult business. But what if we narrow our sights to a simpler, more immediate future? What if we just try to predict the very next word someone is about to say?

Imagine you hear the sentence: "The sky over Saptrishi Dumbeldor today is so beautifully..."

Your mind instinctively fills in the blank. Words like blue, clear, or calm probably spring to mind. It's highly unlikely you thought of a refrigerator or this.

This intuition, this innate sense of what words fit together, is the very essence of what we call a language model. At its heart, a language model is a system that assigns a probability to a sequence of words, telling us how likely that sequence is to occur in the real world.

This isn't just an academic exercise. This predictive power is the engine behind correcting a typo from "Their is a big difference between ..." to the more probable "There is a big difference between.." It's how a speech recognition system knows you likely said "I will be back soonish" and not the nonsensical "I will be bassoon dish." And, most profoundly, this simple act of word prediction is the foundational training task for the massive large language models that are reshaping our world today.

The Dragon of Infinite Possibilities

So, how do we build such a model? The most straightforward idea is to rely on simple observation. To figure out the probability of blue following our Saptrishi Dumbeldor Sky sentence, we could just take a massive collection of text – say, the entire internet – and count. We would calculate:

(The number of times we saw "The sky over Saptrishi Dumbeldor today is so beautifully blue") </br> ÷ </br> (The number of times we saw "The sky over Saptrishi Dumbeldor today is so beautifully")

This makes perfect sense, but it runs headfirst into a catastrophic problem: the boundless creativity of human language. The exact phrase "The sky over San Francisco today is so beautifully" is almost certainly unique. It has likely never been written before, so our count for it would be zero, making our calculation impossible. We'll almost never find enough data to compute the probability of a word given a long, specific history. This problem is often called data sparsity.

A Clever Trick with a Catch

To solve this, we can turn to a fundamental tool from probability theory: the chain rule.

The chain rule allows us to break down the probability of an entire sentence into smaller, connected pieces. The probability of a whole sentence (w₁, w₂, w₃, ... wₙ) can be expressed as a product of conditional probabilities:

P(sentence) = P(w₁) × P(w₂ | w₁) × P(w₃ | w₁, w₂) × ... × P(wₙ | w₁, w₂, ..., wₙ₋₁)

In plain English, the probability of a sentence is the probability of the first word, times the probability of the second word given the first, times the probability of the third given the first two, and so on.

This is mathematically elegant, but it doesn't actually solve our problem. We're still stuck trying to calculate the probability of a word given a long, unique history, like P(wₙ | w₁, w₂, ..., wₙ₋₁). We've just rephrased the impossible task.

The Breakthrough: A Pragmatic Compromise

This is where the core insight of the n-gram model comes in. Instead of wrestling with the entire history of a sentence, we make a radical, simplifying assumption: a word's identity depends only on the few words that came immediately before it.

This is called the Markov assumption. It's a pragmatic compromise that states the recent past is a good enough proxy for the entire past.

With this assumption, our impossibly complex problem suddenly becomes manageable.

A bigram model (where n=2) operates on a memory of just one word. It approximates the probability of "blue" following our long sentence with the much simpler probability of "blue" following "beautifully". We are approximating P(blue | The sky over Saptrishi Dumbeldor today is so beautifully) with just P(blue | beautifully).
A trigram model (where n=3) has a slightly better memory, looking two words back. It would use P(blue | so beautifully).

The general n-gram model looks at the previous n-1 words to predict the next one. This is a powerful knob we can turn. A larger n captures more context but risks running back into our data sparsity problem. A smaller n is more robust but might miss crucial long-range dependencies.

Is It Any Good? The Perplexity Question

So we’ve built our n-gram model. We fed it a giant corpus of text, and it dutifully counted all the word pairs (bigrams) and triplets (trigrams). But how do we know if it’s any good?

In machine learning, we never test a model on the same data it trained on. That would be like giving a student an exam with the exact same questions they studied in their textbook. They might get a perfect score, but it wouldn't tell you if they actually learned the concepts.

Instead, we use three distinct datasets:

Training Set: The bulk of our data, used to learn the n-gram probabilities. (The textbook).
Development Set: A smaller, separate dataset used to tune our model and make design choices. (The practice exams).
Test Set: A final, unseen dataset that the model is only evaluated on once to get its final score. (The final exam).

The key question is: what score are we looking for? We want to measure how well our model predicts the test set. A good model will assign a high probability to the sentences in the test set. It won't be "surprised" by them. A bad model will be very surprised, meaning it assigns a very low probability to the sentences it sees.

We use a metric called Perplexity (PPL) to measure this "surprise". While the formula looks a bit intimidating (PPL(W) = (1 / P(W))^(1/N)), the intuition is simple:

Perplexity is the inverse probability of the test set, normalized for its length.
Because it’s an inverse, a higher probability results in a lower perplexity.
Therefore, a lower perplexity score is better.

You can think of perplexity as the weighted average branching factor. If a language model has a perplexity of 100 on a test set, it means that at each word, it was as confused as if it were choosing between 100 equally likely words. A model with a perplexity of 20 is much more confident and has a better idea of what's coming next.

The Breakthrough: A Pragmatic Compromise

This is called the Markov assumption. It's a pragmatic compromise that states the recent past is a good enough proxy for the entire past.

The Legacy of N-Grams

If you ask an n-gram model trained on Shakespeare to write a sentence, you might get:

This shall forbid it should be branded, if renown made it empty.

If you ask one trained on the Wall Street Journal, it might say:

They also point to ninety-nine point six billion dollars.

These examples perfectly capture both the power and limitations of n-grams. They are exceptionally good at capturing the statistical patterns, style, and vocabulary of the data they are trained on. But they have no deeper understanding. They don't know what "renown" or "billion dollars" actually means. They are just brilliant pattern-matchers.

Today, n-gram models have been largely replaced in cutting-edge applications by neural network-based models. These modern marvels, like the Transformer architecture that powers ChatGPT, learn word representations in a continuous space, allowing them to understand that "boat" is more similar to "ship" than it is to "banana" - a form of generalization that n-grams could never achieve.

However, the core lessons of n-grams remain as relevant as ever. The entire paradigm of building a language model - training on vast data, evaluating with perplexity on a held-out test set, and finding clever ways to handle the infinite creativity of language - was pioneered and perfected with these simple, powerful models. They were the first step in teaching machines to speak our language, and they remain the best first step for anyone looking to understand the probabilistic heart of modern AI.

By applying this Markov assumption to the chain rule, we transform our calculation for the probability of a sentence into a product of simple, countable n-gram probabilities. For a bigram model, our daunting calculation becomes:

P(sentence) ≈ P(w₁) × P(w₂ | w₁) × P(w₃ | w₂) × ... × P(wₙ | wₙ₋₁)

Each piece of this puzzle, like P(w₃ | w₂), is something we can actually estimate from a large body of text by simply counting how often w₃ follows w₂.

With this assumption, our impossibly complex problem suddenly becomes manageable.

A bigram model (where n=2) operates on a memory of just one word. It approximates the probability of "blue" following our long sentence with the much simpler probability of "blue" following "beautifully". We are approximating P(blue | The sky over Saptrishi Dumbeldor today is so beautifully) with just P(blue | beautifully).
A trigram model (where n=3) has a slightly better memory, looking two words back. It would use P(blue | so beautifully).

P(sentence) ≈ P(w₁) × P(w₂ | w₁) × P(w₃ | w₂) × ... × P(wₙ | wₙ₋₁)

Each piece of this puzzle, like P(w₃ | w₂), is something we can actually estimate from a large body of text by simply counting how often w₃ follows w₂.

Is It Any Good? The Perplexity Question

So we’ve built our n-gram model. We fed it a giant corpus of text, and it dutifully counted all the word pairs (bigrams) and triplets (trigrams). But how do we know if it’s any good?

Instead, we use three distinct datasets:

Training Set: The bulk of our data, used to learn the n-gram probabilities. (The textbook).
Development Set: A smaller, separate dataset used to tune our model and make design choices. (The practice exams).
Test Set: A final, unseen dataset that the model is only evaluated on once to get its final score. (The final exam).

We use a metric called Perplexity (PPL) to measure this "surprise". While the formula looks a bit intimidating (PPL(W) = (1 / P(W))^(1/N)), the intuition is simple:

Perplexity is the inverse probability of the test set, normalized for its length.
Because it’s an inverse, a higher probability results in a lower perplexity.
Therefore, a lower perplexity score is better.

The Achilles' Heel: What Happens When You've Seen Nothing?

Our model seems great, but it has a catastrophic flaw. Imagine we train our model on a massive corpus of news articles.

We then show it a sentence from a test set: "I want to eat ruby slippers."

It’s possible that in our millions of articles, the model saw the word "ruby" and the word "slippers", but it never saw the specific bigram "ruby slippers". According to our Maximum Likelihood Estimation formula:

P(slippers | ruby) = Count(ruby slippers) / Count(ruby) = 0 / Count(ruby) = 0

The probability is zero. Because we multiply all the probabilities in a sentence together, a single zero makes the entire sentence's probability zero. This sends our perplexity score to infinity and breaks the model completely. This is the zero-frequency problem, and it’s a huge deal because no matter how big your training data is, you can never observe every possible combination of words.

The solution is a technique called smoothing. The core idea is to act like a linguistic Robin Hood: steal a tiny bit of probability mass from the events we have seen and distribute it to all the events we've never seen, just in case they show up later.

The simplest (and most brutish) way to do this is Laplace Smoothing, also known as ‘Add-One Smoothing’. You just add 1 to every single n-gram count in your data before you calculate the probabilities. So, "ruby slippers" now has a count of 1 instead of 0. Problem solved!

Well, not quite. Add-one smoothing is a blunt instrument. By giving a count of 1 to trillions of unseen n-grams, it steals too much probability mass from the n-grams we actually saw, distorting our original model.

A much more elegant solution is interpolation. Instead of just relying on the trigram probability, we can create a blended estimate by combining the trigram, bigram, and unigram predictions.

P_final(w3 | w1, w2) = λ1*P(w3 | w1, w2) + λ2*P(w3 | w2) + λ3*P(w3)

If our model has never seen the trigram (buy, ruby, slippers), it can "fall back" on the bigram probability of (ruby, slippers) and even the simple unigram probability of (slippers). The lambda (λ) weights determine how much we trust each n-gram level, and they are cleverly learned from a held-out development set. This way, we get the best of all worlds: we rely on longer contexts when we have good evidence for them, but gracefully back off to shorter, more reliable contexts when we encounter something new.

When we build n-gram models, we see a clear trend:

Unigram Model PPL: Very high (e.g., 962)
Bigram Model PPL: Much lower (e.g., 170)
Trigram Model PPL: Even lower (e.g., 109)

More context (a larger n) helps the model make better predictions, reducing its surprise and lowering its perplexity.