Language Modeling
Language models, HMMs, maximum entropy, and conditional random fields.
Language Models
What is a Language Model?
At its absolute core, a Language Model (LM) does one simple mathematical thing: it assigns probabilities to sequences of words. It determines how "likely" a specific sentence is to exist in a given language.
High Probability (Valid English)
Low Probability (Gibberish)
The Goal: Next Word Prediction
Because of the chain rule of probability, assigning probabilities to full sentences is analytically identical to the task of Next Word Prediction (Autoregressive task).
A good language model will assign a high probability to words like "latte" or "cappuccino", and a near-zero probability to words like "car" or "elephant".
The Evolution of Language Models
| Era | Model Type | How it predicts the next word |
|---|---|---|
| 1990s | Statistical N-gram Models | Counts frequency of (n-1) previous words matching in a database table. Extremely limited memory. |
| 2010s | Recurrent Neural Nets (RNNs) | Passes a "hidden state" vector left-to-right through a neural network. Can remember long-term context, but suffers from vanishing gradients. |
| 2018 - Present | Transformer LLMs (GPT) | Uses "Self-Attention" to look at every word in the sentence simultaneously. Capable of trillions of parameters. Human-like reasoning capabilities. |
Hidden Markov Models (HMM)
Hidden Markov Models (HMMs)
Before Deep Learning, Hidden Markov Models (HMMs) were the absolute gold standard for NLP sequence tasks like Part-of-Speech tagging and Speech Recognition. HMMs are probabilistic graphical models.
The Concept of "Hidden" States
Imagine you are trying to predict the weather (Sunny or Rainy) based only on how your friend dresses (T-shirt or Coat) when they come inside.
- Observations (Visible): The words in a sentence. (e.g., "The", "dog", "runs"). Or your friend's clothes.
- Hidden States: The underlying labels we want to guess. (e.g., Determiner, Noun, Verb). Or the actual Weather.
The Two Probabilities of HMM
1. Transition Probabilities
The probability of moving from one Hidden State to another Hidden State.
P( Verb | Noun ) = 0.61
Meaning: If I just saw the word "The" (Determiner), there's an 80% chance the very next word will be a Noun.
2. Emission Probabilities
The probability of a Hidden State generating/emitting a specific Observation (Word).
P( "runs" | Verb ) = 0.02
Meaning: If the true underlying state is a Noun, there's a 5% chance the specific word written down is "dog".
The Viterbi Algorithm
If we give the HMM a sentence ("The dog runs"), how does it find the correct sequence of POS tags?
It uses Dynamic Programming via the Viterbi Algorithm. Viterbi calculates the most probable path of hidden states by multiplying the Transition and Emission probabilities together at every step, keeping track of the highest-scoring sequence through the network!
Maximum Entropy Models
Maximum Entropy (MaxEnt) Models
Maximum Entropy (MaxEnt) is a powerful probabilistic classifier widely used in NLP for text classification and Named Entity Recognition. In Machine Learning contexts, a MaxEnt classifier is mathematically identical to Multinomial Logistic Regression.
"Model all that is known and assume nothing about that which is unknown."
The Information Theory Approach
In Information Theory, "Entropy" is a measure of uncertainty or randomness. To satisfy the core principle, a MaxEnt model chooses the probability distribution that has the highest entropy (most uniform/flat distribution) subject to matching the empirical constraints seen in the training data.
A Simple NLP Example
Suppose we want to classify a document's topic into 3 labels: [Politics, Sports, Technology]
- Zero Knowledge: If we have no features extracted from the text, MaxEnt assigns equal probability (highest entropy) to all:
P(Politics)=33%, P(Sports)=33%, P(Tech)=33% - Applying Constraints (Features): If the document contains the word "ball", our training data indicates the topic is never Politics. However, we have no data favoring Sports vs Tech. MaxEnt distributes the remaining probability uniformly:
Feature="ball" → P(Politics)=0%, P(Sports)=50%, P(Tech)=50%
Why use MaxEnt in NLP?
Compared to models like Naive Bayes, MaxEnt is highly advantageous for NLP because it does not assume features are statistically independent.
In NLP, words are highly correlated. The presence of the word "Hong" strongly guarantees the word "Kong". Naive Bayes gets confused by this correlation and over-counts the evidence. MaxEnt models handle overlapping contextual features elegantly through learned weights.
Conditional Random Fields (CRF)
Conditional Random Fields (CRF)
Conditional Random Fields (CRFs) are the ultimate evolution of statistical sequence modeling prior to the deep learning era. They combine the ability of Hidden Markov Models (HMMs) to predict sequences with the ability of Maximum Entropy models to use vast numbers of overlapping, custom features.
Generative vs Discriminative
- HMMs are Generative: They model the joint probability
P(Labels, Words). They try to learn how the data was generated. - CRFs are Discriminative: They model the conditional probability
P(Labels | Words)directly. They don't care about predicting the data; they only care about drawing the boundary between the correct labels!
The Feature Function Advantage
The superpower of CRFs in Named Entity Recognition (NER) is that you can hand-craft thousands of highly specific "Feature Functions" that look at the entire sentence at once, not just the previous state.
Example CRF Custom Features for NER
If we are predicting whether the current word is a "Person" entity, a CRF can ingest all these features simultaneously:
- F1: Is the current word Capitalized? (Yes/No)
- F2: Does the previous word == "Mr."? (Yes/No)
- F3: Is the word entirely digits? (Yes/No)
- F4: Is the word in our predefined list of cities? (Yes/No)
- F5: Does the suffix of the word end in "-tion"? (Yes/No)
An HMM cannot handle these overlapping features because they violate its strict independence assumptions. A CRF assigns a mathematical weight to each of these functions and sums them up contextually.
Modern Usage: BiLSTM-CRF
CRFs didn't die with the advent of Deep Learning! In fact, the state-of-the-art for NER before Transformers was the BiLSTM-CRF architecture.
The BiLSTM reads the text and extracts neural features, outputting raw scores for tags. A CRF layer is tacked onto the very end to enforce strict sequence rules (e.g., ensuring an 'Inside-Person' tag never directly follows a 'Beginning-Location' tag).