Huffman Coding
Given character frequencies, build a prefix-free binary code that minimizes total encoded length. Huffman’s greedy algorithm repeatedly merges the two smallest frequencies into a binary tree. This lesson covers the algorithm, tree construction trace, code assignment, and implementations in C, Java, and Python.
Overview
Variable-length codes save space: frequent symbols get shorter bit strings, rare symbols get longer ones. Huffman coding guarantees an optimal prefix code — no codeword is a prefix of another, so decoding is unambiguous.
Frequencies: f:5 e:9 c:12 b:13 d:16 a:45 Greedy merges (two smallest each step): f+e → 14, c+b → 25, 14+d → 30, 25+30 → 55, a+55 → 100 (root) Sample codes (left=0, right=1): a:0 c:100 b:101 f:1100 e:1101 d:111 Weighted bit cost = 224 (optimal for these frequencies)
Problem Statement
Input: A set of characters (or symbols) with non-negative frequencies freq[c].
Output: A binary prefix code for each symbol minimizingΣ freq[c] × code_length[c] (total weighted path length in the Huffman tree).
Prefix property: For any two symbols, neither codeword is a prefix of the other.
Complexity target: O(n log n) with a min-heap (n = distinct symbols); O(n) space for the tree and code table.
Greedy Algorithm
- Create a leaf node for each symbol with its frequency.
- Insert all nodes into a min-priority queue keyed by frequency.
- While more than one node remains:
- Extract the two nodes with smallest frequency.
- Create a new internal node with those two as children; its frequency = sum of child frequencies.
- Insert the new node back into the queue.
- The last remaining node is the root of the Huffman tree.
- Assign
0to left edges and1to right edges; leaf paths form codewords.
Why Greedy Works
Exchange argument (sketch): In an optimal tree, the two symbols with lowest frequency can be assumed to be sibling leaves at maximum depth (swap them with deepest leaves without increasing cost). Removing these two leaves and replacing them with one parent reduces the problem size by one with adjusted frequencies — the same greedy step. By induction, Huffman’s merge rule is optimal.
Code (C, Java, Python)
Python
import heapq
class Node:
__slots__ = ("freq", "char", "left", "right")
def __init__(self, freq, char=None, left=None, right=None):
self.freq = freq
self.char = char
self.left = left
self.right = right
def __lt__(self, other):
return self.freq < other.freq
def build_huffman_tree(freq_map):
"""Greedy: repeatedly merge two smallest frequencies. O(n log n)."""
heap = [Node(f, c) for c, f in freq_map.items()]
heapq.heapify(heap)
while len(heap) > 1:
left = heapq.heappop(heap)
right = heapq.heappop(heap)
parent = Node(left.freq + right.freq, None, left, right)
heapq.heappush(heap, parent)
return heap[0] # root
def build_codes(root, prefix="", codes=None):
if codes is None:
codes = {}
if root.char is not None:
codes[root.char] = prefix or "0"
return codes
build_codes(root.left, prefix + "0", codes)
build_codes(root.right, prefix + "1", codes)
return codes
if __name__ == "__main__":
freq = {"a": 45, "b": 13, "c": 12, "d": 16, "e": 9, "f": 5}
root = build_huffman_tree(freq)
print(build_codes(root))
Java
import java.util.HashMap;
import java.util.Map;
import java.util.PriorityQueue;
public class HuffmanCoding {
static class Node implements Comparable<Node> {
int freq;
Character ch;
Node left, right;
Node(int f, Character c) { freq = f; ch = c; }
Node(int f, Node l, Node r) { freq = f; left = l; right = r; }
public int compareTo(Node o) { return Integer.compare(freq, o.freq); }
}
/** Build Huffman tree by merging two smallest nodes each step. */
public static Node buildTree(Map<Character, Integer> freq) {
PriorityQueue<Node> pq = new PriorityQueue<>();
freq.forEach((c, f) -> pq.offer(new Node(f, c)));
while (pq.size() > 1) {
Node left = pq.poll();
Node right = pq.poll();
pq.offer(new Node(left.freq + right.freq, left, right));
}
return pq.peek();
}
public static void buildCodes(Node node, String path, Map<Character, String> codes) {
if (node.ch != null) {
codes.put(node.ch, path.isEmpty() ? "0" : path);
return;
}
buildCodes(node.left, path + "0", codes);
buildCodes(node.right, path + "1", codes);
}
public static void main(String[] args) {
Map<Character, Integer> freq = Map.of(
'a', 45, 'b', 13, 'c', 12, 'd', 16, 'e', 9, 'f', 5);
Map<Character, String> codes = new HashMap<>();
buildCodes(buildTree(freq), "", codes);
System.out.println(codes);
}
}
C
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
typedef struct Node {
char ch;
int freq;
struct Node *left, *right;
} Node;
Node *new_node(char ch, int freq, Node *l, Node *r) {
Node *n = (Node *)malloc(sizeof(Node));
n->ch = ch; n->freq = freq; n->left = l; n->right = r;
return n;
}
/* Find two smallest nodes in array list; merge and compact. */
Node *build_huffman(Node *nodes[], int *n) {
while (*n > 1) {
int i, min1 = 0, min2 = 1;
if (nodes[min2]->freq < nodes[min1]->freq) { int t = min1; min1 = min2; min2 = t; }
for (i = 2; i < *n; i++) {
if (nodes[i]->freq < nodes[min1]->freq) { min2 = min1; min1 = i; }
else if (nodes[i]->freq < nodes[min2]->freq) min2 = i;
}
Node *parent = new_node('\0', nodes[min1]->freq + nodes[min2]->freq,
nodes[min1], nodes[min2]);
nodes[min1] = parent;
nodes[min2] = nodes[*n - 1];
(*n)--;
}
return nodes[0];
}
void print_codes(Node *root, char *path, int depth) {
if (!root) return;
if (!root->left && !root->right) {
path[depth] = '\0';
printf("%c:%s ", root->ch, depth ? path : "0");
return;
}
path[depth] = '0'; print_codes(root->left, path, depth + 1);
path[depth] = '1'; print_codes(root->right, path, depth + 1);
}
int main(void) {
Node *nodes[] = {
new_node('a', 45, NULL, NULL), new_node('b', 13, NULL, NULL),
new_node('c', 12, NULL, NULL), new_node('d', 16, NULL, NULL),
new_node('e', 9, NULL, NULL), new_node('f', 5, NULL, NULL)
};
int n = 6;
Node *root = build_huffman(nodes, &n);
char path[64];
print_codes(root, path, 0);
printf("\n");
return 0;
}
Output & Trace
Input frequencies
| Symbol | Frequency | Fixed 3-bit code | Huffman code (sample) | Code length |
|---|---|---|---|---|
| a | 45 | 000 | 0 | 1 |
| b | 13 | 001 | 101 | 3 |
| c | 12 | 010 | 100 | 3 |
| d | 16 | 011 | 111 | 3 |
| e | 9 | 100 | 1101 | 4 |
| f | 5 | 101 | 1100 | 4 |
Fixed 3-bit encoding cost: 100 × 3 = 300 bits. Huffman weighted cost: 45×1 + 13×3 + 12×3 + 16×3 + 9×4 + 5×4 = 224 bits.
Greedy merge trace
| Step | Merge (two smallest) | New node freq | Active multiset |
|---|---|---|---|
| 0 | — | — | f5, e9, c12, b13, d16, a45 |
| 1 | f + e | 14 | c12, b13, d16, 14, a45 |
| 2 | c + b | 25 | d16, 14, 25, a45 |
| 3 | 14 + d | 30 | 25, 30, a45 |
| 4 | 25 + 30 | 55 | a45, 55 |
| 5 | a + 55 | 100 | 100 (root) ← done |
Tree sketch (left = 0, right = 1)
(100)
/ \
a:45 (55)
/ \
(25) (30)
/ \ / \
c:12 b:13 (14) d:16
/ \
f:5 e:9
Encoding & Decoding
- Encode: Replace each symbol with its Huffman codeword; concatenate bits.
- Decode: Walk from root — bit
0goes left,1goes right; on reaching a leaf, emit symbol and restart at root. - Store tree or code table with the compressed data (header overhead matters for small files).
Variants & Related Topics
| Topic | Description | Notes |
|---|---|---|
| Canonical Huffman | Standardized code ordering for fast decode | Used in DEFLATE/gzip |
| Adaptive Huffman | Update tree as symbols stream in | No upfront frequency table |
| Shannon-Fano | Split symbols by cumulative probability | Near-optimal, not always minimal |
| Arithmetic coding | Encode whole message as one fraction | Often beats Huffman on small alphabets |
| Run-length + Huffman | Two-stage compression pipeline | JPEG, PNG use similar ideas |
Approach Comparison
| Approach | Time | Space | Notes |
|---|---|---|---|
| Huffman greedy (min-heap) | O(n log n) | O(n) | Optimal prefix code — standard |
| Fixed-length codes | O(n) | O(1) | Simple but usually longer output |
| Brute force codeword assignment | Exponential | — | Impractical |
| Shannon-Fano | O(n log n) | O(n) | Good but not always optimal |
Real-Life Applications
| Domain | Scenario | Mapping |
|---|---|---|
| gzip / DEFLATE | Compress text and files for storage | Huffman stage after LZ77 |
| JPEG | Compress DCT coefficients | Huffman tables in image headers |
| MP3 / AAC | Encode audio frequency data | Entropy coding layer |
| Fax machines | Encode run lengths of black/white pixels | Modified Huffman codes |
| Network protocols | Compress HTTP headers (HPACK) | Huffman-coded static/dynamic tables |
| DNA / bioinformatics | Compress symbol streams (A,C,G,T) | Frequency-based prefix codes |
Interview Tips
- State the greedy rule clearly: always merge the two smallest frequencies.
- Use a min-heap — mention O(n log n) time and why.
- Explain the prefix property and why it enables greedy decoding.
- Draw a small merge trace (4–6 symbols) if asked — interviewers love the step-by-step table.
- Distinguish from 0-1 knapsack and fractional knapsack — Huffman is a tree-building greedy, not ratio sorting.
- Mention real usage: gzip, JPEG — shows practical awareness.
Key Takeaway
Huffman coding = repeatedly merge two least frequent nodes into a binary tree, then assign 0/1 to edges. Produces an optimal prefix code minimizing total weighted bit length. Use a min-heap for efficiency.
Next up: Job sequencing · Fractional knapsack · Greedy introduction