Multilingual & Low-Resource NLP
Cross-lingual models, few-shot learning, and NLP for low-resource languages.
Low-Resource NLP
Leveraging Small Datasets
NLP traditionally relies on massive amounts of data (billions of tokens). However, most of the world's 7,000+ languages are "low-resource." This section covers how we build AI when data is scarce.
Level 1 — Learning Transfer
Instead of training from scratch, we use Pre-trained Models. A model trained on 100GB of English can often understand the structure of Swahili after seeing only a few thousand examples.
Few-Shot Learning: Providing just 2-5 examples of a task directly in the prompt is often enough for modern LLMs to perform reasonably well.
Level 2 — Data Augmentation
When you don't have data, you make it. Common techniques include:
- Back-translation: Translate English to French and back to English to create a slightly different (augmented) sentence.
- Thesaurus Substitution: Replace words with synonyms.
- Self-Training: Use a model to label unlabeled data, then retrain the model on its own best guesses.
Level 3 — Cross-Lingual Projection
Researchers use Parallel Corpora (translations of the same text) to "project" knowledge from a high-resource language like English onto a low-resource one. This involves aligning the vector spaces so that "Apple" and "Poma" (Catalan) overlap perfectly.
# Example workflow for augmenting a low-resource dataset
original = "The crop yield was very low this year."
en_to_fr = model.translate(original, target="fr") # "La récolte a été très faible cette année."
fr_to_en = model.translate(en_to_fr, target="en") # "The harvest was very poor this year."
# Now you have TWO unique sentences for training!
Multilingual NLP
One Model, 100+ Languages
Modern models like mBERT and XLM-RoBERTa are trained on 100+ languages simultaneously. This allows them to create a universal language representation.
Level 1 — Multilingual vs Monolingual
A monolingual model is usually more accurate in its specific language (e.g., a pure French BERT), but a multilingual model allows for Zero-Shot Cross-Lingual Transfer: Fine-tune in English, it works in Hindi.
Level 2 — Shared Vocabulary
Multilingual models use sub-word tokenization (like BPE or SentencePiece) over a combined dataset. This creates a vocabulary where common roots across languages (like "bio-") are shared, reducing model size.
Level 3 — Language Adaptation
In advanced production, we use Adapters. These are small sets of trainable parameters inserted into a massive multilingual model. You only train the "French Adapter" while keeping the main 100-language model weights frozen.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# This model understands 100 languages out of the box
model_name = "xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
inputs = tokenizer("Hello, I love NLP!", return_tensors="pt")
inputs_es = tokenizer("¡Hola, me encanta el PLN!", return_tensors="pt")