Language Modeling | Nikhil Learn Hub

Language Models

What is a Language Model?

At its absolute core, a Language Model (LM) does one simple mathematical thing: it assigns probabilities to sequences of words. It determines how "likely" a specific sentence is to exist in a given language.

High Probability (Valid English)

P("The cat sat on the mat") = 0.95

Low Probability (Gibberish)

P("Mat the on sat cat the") = 0.0001

The Goal: Next Word Prediction

Because of the chain rule of probability, assigning probabilities to full sentences is analytically identical to the task of Next Word Prediction (Autoregressive task).

"I went to the coffee shop and ordered a ________"

A good language model will assign a high probability to words like "latte" or "cappuccino", and a near-zero probability to words like "car" or "elephant".

The Evolution of Language Models

Era	Model Type	How it predicts the next word
1990s	Statistical N-gram Models	Counts frequency of (n-1) previous words matching in a database table. Extremely limited memory.
2010s	Recurrent Neural Nets (RNNs)	Passes a "hidden state" vector left-to-right through a neural network. Can remember long-term context, but suffers from vanishing gradients.
2018 - Present	Transformer LLMs (GPT)	Uses "Self-Attention" to look at every word in the sentence simultaneously. Capable of trillions of parameters. Human-like reasoning capabilities.

Hidden Markov Models (HMM)

Hidden Markov Models (HMMs)

Before Deep Learning, Hidden Markov Models (HMMs) were the absolute gold standard for NLP sequence tasks like Part-of-Speech tagging and Speech Recognition. HMMs are probabilistic graphical models.

The Concept of "Hidden" States

Imagine you are trying to predict the weather (Sunny or Rainy) based only on how your friend dresses (T-shirt or Coat) when they come inside.

Observations (Visible): The words in a sentence. (e.g., "The", "dog", "runs"). Or your friend's clothes.
Hidden States: The underlying labels we want to guess. (e.g., Determiner, Noun, Verb). Or the actual Weather.

The Two Probabilities of HMM

1. Transition Probabilities

The probability of moving from one Hidden State to another Hidden State.

P( Noun | Determiner ) = 0.8
P( Verb | Noun ) = 0.61

Meaning: If I just saw the word "The" (Determiner), there's an 80% chance the very next word will be a Noun.

2. Emission Probabilities

The probability of a Hidden State generating/emitting a specific Observation (Word).

P( "dog" | Noun ) = 0.05
P( "runs" | Verb ) = 0.02

Meaning: If the true underlying state is a Noun, there's a 5% chance the specific word written down is "dog".

The Viterbi Algorithm

If we give the HMM a sentence ("The dog runs"), how does it find the correct sequence of POS tags?

It uses Dynamic Programming via the Viterbi Algorithm. Viterbi calculates the most probable path of hidden states by multiplying the Transition and Emission probabilities together at every step, keeping track of the highest-scoring sequence through the network!

Maximum Entropy Models

Maximum Entropy (MaxEnt) Models

Maximum Entropy (MaxEnt) is a powerful probabilistic classifier widely used in NLP for text classification and Named Entity Recognition. In Machine Learning contexts, a MaxEnt classifier is mathematically identical to Multinomial Logistic Regression.

The Core Principle of Maximum Entropy:
"Model all that is known and assume nothing about that which is unknown."

The Information Theory Approach

In Information Theory, "Entropy" is a measure of uncertainty or randomness. To satisfy the core principle, a MaxEnt model chooses the probability distribution that has the highest entropy (most uniform/flat distribution) subject to matching the empirical constraints seen in the training data.

A Simple NLP Example

Suppose we want to classify a document's topic into 3 labels: [Politics, Sports, Technology]

Zero Knowledge: If we have no features extracted from the text, MaxEnt assigns equal probability (highest entropy) to all:
P(Politics)=33%, P(Sports)=33%, P(Tech)=33%
Applying Constraints (Features): If the document contains the word "ball", our training data indicates the topic is never Politics. However, we have no data favoring Sports vs Tech. MaxEnt distributes the remaining probability uniformly:
Feature="ball" → P(Politics)=0%, P(Sports)=50%, P(Tech)=50%

Why use MaxEnt in NLP?

Compared to models like Naive Bayes, MaxEnt is highly advantageous for NLP because it does not assume features are statistically independent.

In NLP, words are highly correlated. The presence of the word "Hong" strongly guarantees the word "Kong". Naive Bayes gets confused by this correlation and over-counts the evidence. MaxEnt models handle overlapping contextual features elegantly through learned weights.

Conditional Random Fields (CRF)

Conditional Random Fields (CRFs) are the ultimate evolution of statistical sequence modeling prior to the deep learning era. They combine the ability of Hidden Markov Models (HMMs) to predict sequences with the ability of Maximum Entropy models to use vast numbers of overlapping, custom features.

                    Generative vs Discriminative
                    HMMs are Generative: They model the joint probability P(Labels, Words). They try to learn how the data was generated.
CRFs are Discriminative: They model the conditional probability P(Labels | Words) directly. They don't care about predicting the data; they only care about drawing the boundary between the correct labels!

                

The Feature Function Advantage

The superpower of CRFs in Named Entity Recognition (NER) is that you can hand-craft thousands of highly specific "Feature Functions" that look at the entire sentence at once, not just the previous state.

Example CRF Custom Features for NER

If we are predicting whether the current word is a "Person" entity, a CRF can ingest all these features simultaneously:

F1: Is the current word Capitalized? (Yes/No)
F2: Does the previous word == "Mr."? (Yes/No)
F3: Is the word entirely digits? (Yes/No)
F4: Is the word in our predefined list of cities? (Yes/No)
F5: Does the suffix of the word end in "-tion"? (Yes/No)

An HMM cannot handle these overlapping features because they violate its strict independence assumptions. A CRF assigns a mathematical weight to each of these functions and sums them up contextually.

Modern Usage: BiLSTM-CRF

CRFs didn't die with the advent of Deep Learning! In fact, the state-of-the-art for NER before Transformers was the BiLSTM-CRF architecture.

The BiLSTM reads the text and extracts neural features, outputting raw scores for tags. A CRF layer is tacked onto the very end to enforce strict sequence rules (e.g., ensuring an 'Inside-Person' tag never directly follows a 'Beginning-Location' tag).