BERT – short Q&A
20 questions and answers on BERT, including its bidirectional encoder architecture, masked language modeling objective, next sentence prediction and common fine-tuning patterns for NLP tasks.
What does BERT stand for?
Answer: BERT stands for Bidirectional Encoder Representations from Transformers, reflecting its use of transformer encoder layers to learn bidirectional contextual embeddings from large text corpora.
What is the main pretraining objective used by BERT?
Answer: The main objective is masked language modeling, where a subset of tokens is masked and BERT learns to predict the original tokens using both left and right context in the sentence.
What is Next Sentence Prediction (NSP) in BERT?
Answer: NSP is an auxiliary task where BERT predicts whether two input sentences are consecutive in the corpus or randomly paired, encouraging the model to learn inter-sentence relationships useful for downstream tasks.
How is input formatted for BERT?
Answer: Inputs are tokenized with WordPiece, wrapped with [CLS] at the start and [SEP] between or after sentences, combined with segment (token type) embeddings and positional embeddings before entering the encoder stack.
How is BERT typically fine-tuned for classification tasks?
Answer: A small task-specific classifier is added on top of the [CLS] embedding from the final layer, and the entire model is fine-tuned end-to-end on labeled data for tasks like sentiment analysis or NLI.
How is BERT used for token-level tasks like NER?
Answer: A classifier (sometimes with a CRF) is applied to each token’s final-layer embedding, predicting labels such as BIO tags for named entities or other sequence labeling tasks.
Why is BERT considered bidirectional?
Answer: BERT’s self-attention layers see tokens on both sides simultaneously during training, unlike left-to-right language models, giving each token representation access to its full left and right context.
What are common BERT model sizes?
Answer: The original paper introduced BERT-Base (12 layers, 768 hidden size, 12 attention heads) and BERT-Large (24 layers, 1024 hidden size, 16 heads), with many later variants scaling up or down from these baselines.
What is WordPiece tokenization in BERT?
Answer: WordPiece splits words into subword units based on frequency statistics, allowing BERT to handle rare or unknown words by composing them from smaller, shared subword pieces in its vocabulary.
What is the role of the [CLS] token in BERT?
Answer: The [CLS] token is added at the beginning of every input and its final hidden state is used as a summary representation for classification or sentence-level prediction tasks during fine-tuning.
How does BERT handle multiple sentences as input?
Answer: BERT uses token type embeddings to distinguish segments (e.g. sentence A vs sentence B) and a [SEP] token between them, enabling tasks like next sentence prediction or sentence-pair classification (e.g. NLI, QA).
What are some limitations of the original BERT model?
Answer: Limitations include quadratic self-attention cost for long sequences, static pretraining corpus, NSP objective’s mixed value and the need for separate fine-tuning per task, which later models sought to improve.
What are some popular BERT variants?
Answer: Variants include RoBERTa (improved training), ALBERT (parameter sharing), DistilBERT (distilled smaller model), BioBERT and SciBERT (domain-specific) and multilingual BERT for many languages.
How is BERT used for extractive QA?
Answer: BERT encodes concatenated question and context, then two classifier heads predict start and end positions in the context sequence to select the answer span, fine-tuned on QA datasets like SQuAD.
Why did BERT significantly improve many NLP benchmarks?
Answer: BERT combined deep bidirectional context with large-scale pretraining, providing strong general-purpose representations that could be adapted with minimal task-specific changes, boosting performance across GLUE, SQuAD and more.
What is the typical fine-tuning procedure for BERT?
Answer: Practitioners add a small output layer for the target task, initialize from pretrained BERT weights, choose an appropriate learning rate schedule and train on task data for a few epochs while monitoring validation performance.
How does BERT differ from classic word embeddings like word2vec?
Answer: Word2vec produces static embeddings per word type, whereas BERT generates contextual embeddings where the same word form has different vectors depending on its surrounding context and position in the sentence.
What are some deployment considerations for BERT models?
Answer: Considerations include model size and latency, use of distilled or quantized variants, batching strategies, hardware acceleration and potential privacy or bias concerns when applying BERT in production systems.
How did BERT influence later transformer-based models?
Answer: BERT popularized large-scale pretraining and fine-tuning, inspiring improved encoder models, encoder–decoder variants and hybrid approaches that refined objectives, architectures and training strategies.
Why is BERT still relevant despite newer models?
Answer: BERT remains a strong baseline and is widely supported in libraries, with many domain-tuned checkpoints; understanding BERT helps interpret, adapt and compare newer transformer-based architectures in NLP.
🔍 BERT concepts covered
This page covers BERT: encoder stack and input formatting, masked language modeling and NSP, fine-tuning for classification, QA and token labeling, key variants and deployment considerations in modern NLP pipelines.