Conditional Random Fields – short Q&A
20 questions and answers on conditional random fields for sequence labeling, including features, inference and training for tasks like POS tagging and NER.
What is a Conditional Random Field (CRF)?
Answer: A CRF is a discriminative probabilistic model that defines a conditional distribution over label sequences given an input sequence, using feature functions and globally normalized potentials.
How does a linear-chain CRF relate to HMMs?
Answer: A linear-chain CRF can be seen as a generalization of an HMM that relaxes generative assumptions and allows arbitrary, overlapping input features while still modeling label dependencies along a chain.
What are feature functions in a CRF?
Answer: Feature functions are indicator or real-valued functions that inspect the input sequence and current/neighboring labels, contributing weighted scores to the overall potential of a label sequence.
Why are CRFs considered discriminative models?
Answer: CRFs directly model P(y|x), the conditional probability of labels given inputs, rather than modeling a joint distribution over inputs and outputs, focusing on decision boundaries for the task.
What is the role of the partition function in CRFs?
Answer: The partition function normalizes unnormalized scores over all possible label sequences to form a valid probability distribution; it is computed efficiently using dynamic programming in linear-chain CRFs.
How do we perform decoding in a linear-chain CRF?
Answer: Decoding typically uses the Viterbi algorithm adapted for CRFs to find the most probable label sequence given the input, using the learned feature weights and transition potentials.
How is training done in CRFs?
Answer: Training maximizes the conditional log-likelihood of labeled sequences with respect to the feature weights, usually via gradient-based optimization that uses forward–backward to compute expected feature counts.
Why are CRFs suitable for sequence labeling tasks like NER?
Answer: CRFs model dependencies between adjacent labels (e.g. BIO tags) and can incorporate rich, overlapping features from the input, helping enforce consistent label sequences and improving accuracy on structured outputs.
What is label bias, and how do CRFs address it?
Answer: Label bias occurs when locally normalized models overemphasize states with few outgoing transitions; globally normalized CRFs avoid this by normalizing over complete label sequences instead of local steps.
How do CRFs incorporate arbitrary input features?
Answer: Feature functions in CRFs can examine any aspect of the input sequence—lexical, orthographic, contextual or external resources—since the model does not require generative independence assumptions over x.
What is the complexity of inference in linear-chain CRFs?
Answer: Inference (forward–backward or Viterbi) in linear-chain CRFs is O(T × |Y|²), where T is sequence length and |Y| is the number of labels, due to dynamic programming over states and positions.
How do we regularize CRF models?
Answer: Regularization terms such as L2 (Gaussian prior) or L1 are added to the objective to prevent overfitting by penalizing large feature weights, encouraging simpler, more generalizable models.
What are higher-order CRFs?
Answer: Higher-order CRFs include dependencies between more than two neighboring labels (e.g. bigram and trigram label features), offering richer label structure at the cost of increased computational complexity.
How are neural networks combined with CRFs in modern NLP?
Answer: Neural sequence encoders (e.g. BiLSTMs or transformers) produce contextual token representations, and a CRF layer on top models label dependencies, yielding BiLSTM-CRF or BERT-CRF architectures.
What is the difference between CRFs and MEMMs?
Answer: Maximum Entropy Markov Models (MEMMs) locally normalize transition probabilities and suffer label bias, whereas CRFs use global normalization across sequences, mitigating that issue while allowing similar feature sets.
What is the gradient of the CRF log-likelihood based on?
Answer: The gradient is the difference between empirical feature counts from the labeled data and the model’s expected feature counts under the current parameters, computed using forward–backward.
Can CRFs handle overlapping and non-independent features?
Answer: Yes, CRFs are designed to handle many overlapping, correlated features without requiring independence assumptions between them, which is a key advantage over some generative models.
Where are CRFs commonly used in NLP beyond POS and NER?
Answer: CRFs are used in tasks such as chunking, shallow parsing, information extraction, dialogue act tagging and any problem that can be formulated as sequence labeling or segmentation.
What libraries provide CRF implementations for NLP?
Answer: Libraries like CRFsuite, Wapiti, sklearn-crfsuite and some deep learning frameworks (via CRF layers) provide tools to train and apply CRF models in practical NLP pipelines.
How do CRFs compare to modern neural sequence models?
Answer: Pure CRFs are often outperformed by large neural models but remain valuable as interpretable, structured components and as the top layer in neural–CRF hybrids for precise sequence labeling.
🔍 CRF concepts covered
This page covers conditional random fields: linear-chain structure, feature design, partition functions, decoding, training and their integration with modern neural encoders.