Natural Language Processing 20 Essential Q/A
NLP Interview Prep

NLP: 20 Interview Questions

Master Natural Language Processing: text preprocessing, word embeddings (Word2Vec, GloVe), transformers (BERT, GPT), NER, sentiment analysis, seq2seq, attention, and LLMs. Concise, interview-ready answers.

Tokenization Word2Vec BERT GPT NER Seq2Seq Sentiment
1 What are the typical stages in an NLP pipeline? ⚡ Easy
Answer:
  1. Data acquisition: collect text corpus.
  2. Text preprocessing: cleaning, normalization, tokenization.
  3. Feature extraction: TF-IDF, embeddings, bag-of-words.
  4. Modeling: traditional ML or deep learning (RNN, Transformer).
  5. Evaluation: accuracy, F1, BLEU, perplexity.
  6. Deployment & monitoring.
2 Differentiate: tokenization, stemming, lemmatization. ⚡ Easy
Answer:
  • Tokenization: splitting text into tokens (words, subwords, characters).
  • Stemming: crude heuristic chop to root form (e.g., "running" → "run"). Porter Stemmer.
  • Lemmatization: dictionary-based, returns actual lemma (e.g., "better" → "good"). Slower but accurate.
from nltk.stem import PorterStemmer, WordNetLemmatizer
3 Compare Bag-of-Words (BoW) and TF-IDF. 📊 Medium
Answer: BoW counts word occurrences; ignores semantics, sparsity, stopwords dominate. TF-IDF: term frequency * inverse document frequency. Downweights common words, highlights discriminative terms. Both lose word order and context.
TF-IDF(t,d,D) = TF(t,d) × log(N / DF(t))
4 Explain Word2Vec. Compare CBOW and Skip-gram. 🔥 Hard
Answer: Word2Vec learns dense vector representations from context.
  • CBOW: predict target word from surrounding context. Faster, better for frequent words.
  • Skip-gram: predict context words from target. Works better for small data, rare words.
Both use negative sampling or hierarchical softmax.
Semantic arithmetic (king - man + woman ≈ queen)
Fixed vocabulary, out-of-vocabulary (OOV)
5 How does GloVe differ from Word2Vec? 📊 Medium
Answer: GloVe (Global Vectors) leverages global corpus statistics (co-occurrence matrix) and factorization, while Word2Vec is prediction-based (local context windows). GloVe trains faster on small data, captures global ratios. Both produce similar quality embeddings.
6 Why subword tokenization? Explain BPE and WordPiece. 🔥 Hard
Answer: Solves OOV and handles morphologically rich languages.
  • BPE (Byte-Pair Encoding): iteratively merges most frequent character pairs. Used in GPT.
  • WordPiece: merges pairs maximizing likelihood (probability gain). Used in BERT.
  • Unigram LM: probabilistically segment.
7 Describe BERT's architecture and pre-training objectives. 🔥 Hard
Answer: BERT = Bidirectional Encoder Representations from Transformers. Encoder-only Transformer, deep bidirectional. Pre-trained on:
  • MLM (Masked Language Model): 15% tokens masked, predict original.
  • NSP (Next Sentence Prediction): predict if sentence B follows A.
Fine-tuned for downstream tasks.
8 Compare GPT and BERT architectures. 🔥 Hard
BERT: Encoder-only, bidirectional, sees all tokens, MLM + NSP. Better for understanding tasks (classification, NER).
GPT: Decoder-only, causal (masked), unidirectional, autoregressive LM. Better for generation (story, code, chat).
9 What is Named Entity Recognition? How is it typically modeled? 📊 Medium
Answer: NER locates and classifies entities (person, org, location, date). Classic: CRF with handcrafted features. Modern: BiLSTM-CRF or Transformer (BERT) + linear layer + CRF. IOB tagging (Inside, Outside, Beginning).
10 List approaches for sentiment analysis. 📊 Medium
Answer:
  1. Lexicon-based: SentiWordNet, VADER – use sentiment dictionaries.
  2. Machine learning: Naive Bayes, SVM with TF-IDF.
  3. Deep learning: LSTM, CNN, Transformer fine-tuning.
  4. Aspect-based: fine-grained (product aspects).
11 How does seq2seq with attention work for NLP tasks? 🔥 Hard
Answer: Encoder RNN reads source sentence, hidden states. Attention computes context vector as weighted sum of encoder states. Decoder RNN generates target tokens, using context and previous output. Breakthrough for MT, summarization.
12 What is perplexity in language modeling? 📊 Medium
Answer: Perplexity (PPL) = exp(cross-entropy loss). Measures how "surprised" the model is by test data. Lower is better. Intuition: 2^entropy = average number of choices per token. Cannot compare across different tokenizations.
PPL = exp( - (1/N) Σ log P(w_i | w_<i) )
13 Differentiate BLEU and ROUGE metrics. 🔥 Hard
Answer:
  • BLEU: precision-based, n-gram overlap, brevity penalty. Used in machine translation.
  • ROUGE: recall-based, measures overlap of n-grams, longest common subsequence. Used in summarization.
Both correlate moderately with human judgment.
14 Why Transformers dominate over LSTMs for NLP? 📊 Medium
Answer:
  • Parallelization (no sequential dependency).
  • Long-range dependencies via direct pairwise attention.
  • Better gradient flow (no vanishing).
  • Scales with compute/data (large LMs).
15 Fine-tuning vs feature-based transfer learning in NLP. 📊 Medium
Answer:
  • Feature-based: use pretrained embeddings (Word2Vec, GloVe) as static features.
  • Fine-tuning: load pretrained model (BERT) and update all weights on downstream task. Generally better performance.
16 What is prompting and in-context learning in LLMs? 🔥 Hard
Answer:
  • Prompting: providing task description in natural language to LLM.
  • In-context learning: include few examples in prompt; model infers pattern without gradient updates.
  • Zero-shot: no examples; few-shot: 2-5 examples.
17 Extractive vs abstractive summarization. 📊 Medium
Answer:
  • Extractive: select sentences/phrases directly from source. Simpler, fact-preserving.
  • Abstractive: generate novel sentences, paraphrase. Harder, requires generation (seq2seq, BART, T5).
18 What is coreference resolution? Give example. 📊 Medium
Answer: Task of clustering mentions that refer to same entity. Example: "John said he would come" → John = he. End-to-end neural models (span-based, BERT) are SOTA.
19 How is POS tagging typically done today? ⚡ Easy
Answer: BiLSTM-CRF or Transformer encoder (BERT) with token-level classifier. Universal Dependencies tags (NOUN, VERB, ADJ). Pre-trained models fine-tuned on annotated corpora (e.g., OntoNotes).
20 Key challenges in deploying large language models? 🔥 Hard
Answer:
  • Hallucination: generating false/unsupported facts.
  • Bias: social stereotypes from training data.
  • Safety: toxic outputs, jailbreaking.
  • Cost: inference expensive, memory footprint.
  • Evaluation: open-ended generation hard to evaluate.

NLP – Interview Cheat Sheet

Preprocessing
  • ✂️ Tokenization (word/subword)
  • 🌿 Lemmatization > Stemming
  • 📊 TF-IDF, BoW
Embeddings
  • Word2Vec CBOW/Skip-gram
  • GloVe Count-based
  • Contextual BERT, ELMo
Architectures
  • RNN/LSTM Sequential
  • Transformer Parallel, SOTA
  • BERT Bidirectional
Metrics
  • BLEU Precision (MT)
  • ROUGE Recall (Summ)
  • Perplexity LM uncertainty

Verdict: "NLP evolution: rules → statistics → embeddings → transformers → LLMs. Understand trade-offs at each stage."