Core NLP concepts – short Q&A
20 foundational NLP questions with concise answers on text, tokens, corpora and classic processing pipelines to strengthen your basics.
What is Natural Language Processing (NLP)?
Answer: NLP is a field at the intersection of computer science, linguistics and AI that focuses on enabling machines to understand, interpret and generate human language in a useful way.
What is the difference between speech and language in NLP?
Answer: Speech concerns the acoustic signal (audio waveform) and is handled by ASR/TTS, while language usually refers to the symbolic text level (tokens, sentences, syntax and semantics) processed by NLP models.
What are tokens, types and vocabulary?
Answer: Tokens are individual word or subword occurrences in a corpus, types are the unique token forms, and the vocabulary is the set of all distinct types observed or allowed by the model.
What is a corpus in NLP?
Answer: A corpus is a large, structured collection of text (or speech transcripts) used for training, evaluating and analyzing NLP models, often curated and annotated for specific tasks.
Why do we need preprocessing before modeling text?
Answer: Preprocessing normalizes raw text (e.g. lowercasing, cleaning, tokenization) and converts it into a consistent, machine-readable form that simplifies feature extraction and improves model robustness.
What is the difference between rule-based and statistical NLP?
Answer: Rule-based NLP relies on handcrafted linguistic rules and lexicons, whereas statistical or neural NLP learns patterns from data using probabilistic models or deep learning.
What does an NLP pipeline typically look like?
Answer: A classic pipeline includes steps like text normalization, tokenization, POS tagging, parsing or chunking, feature extraction and finally a task-specific model such as a classifier or tagger.
What are downstream NLP tasks?
Answer: Downstream tasks are end-user applications built on top of language representations, such as sentiment analysis, machine translation, question answering, summarization and information extraction.
What is the bag-of-words assumption?
Answer: The bag-of-words assumption treats a document as an unordered collection of tokens, ignoring word order and syntax while focusing on token counts or weights as features.
Why is sparsity an issue in traditional NLP features?
Answer: High-dimensional count-based features like one-hot vectors or bag-of-words lead to sparse representations, which can cause inefficiency and poor generalization when data is limited.
What is distributional semantics?
Answer: Distributional semantics is the idea that word meaning can be inferred from the contexts in which words appear, leading to vector-space models where similar words have similar context distributions.
How do neural embeddings improve on classic features?
Answer: Neural embeddings map words or subwords to dense, low-dimensional vectors that capture semantic similarity and reduce sparsity compared to one-hot or count-based representations.
What is the role of labeled data in supervised NLP?
Answer: Labeled examples (text plus gold labels) are required to train supervised models for tasks like classification or tagging, guiding the model to learn mappings from inputs to desired outputs.
Why are train/validation/test splits important?
Answer: Splitting data prevents overfitting to a single dataset portion, enabling you to tune hyperparameters on validation data and report unbiased performance on a held-out test set.
What is the difference between accuracy and F1 in NLP evaluation?
Answer: Accuracy measures the fraction of correctly predicted instances, while F1 balances precision and recall and is preferred when classes are imbalanced or positive labels are rare.
What are stopwords and why might we remove them?
Answer: Stopwords are very frequent function words (like “the”, “is”) that often carry little task-specific information; removing them can reduce noise and dimensionality for some models.
What is lemmatization in contrast to stemming?
Answer: Stemming crudely chops word endings to obtain a root form, while lemmatization uses vocabulary and morphology to return a valid base form (lemma) like “better” → “good”.
What is language ambiguity in NLP?
Answer: Ambiguity arises when the same surface form has multiple possible interpretations, such as lexical ambiguity (word senses), syntactic ambiguity (parse structures) or referential ambiguity (pronoun links).
Why is domain and genre important for NLP systems?
Answer: Language statistics change across domains (news vs. social media vs. medical text), so models trained in one domain may not generalize well without adaptation to another.
How do modern transformer models relate to basic NLP concepts?
Answer: Transformers still rely on tokens, vocabularies and corpora, but they learn contextual embeddings and attention-based representations that unify many classic NLP tasks in a single architecture.
🔍 NLP basics concepts covered
This page covers NLP fundamentals: tokens and types, corpora, pipelines, classic features and evaluation metrics that underpin more advanced NLP models.