NLP Tutorial

Multilingual & Low-Resource NLP

Cross-lingual models, few-shot learning, and NLP for low-resource languages.

Low-Resource NLP

Leveraging Small Datasets

NLP traditionally relies on massive amounts of data (billions of tokens). However, most of the world's 7,000+ languages are "low-resource." This section covers how we build AI when data is scarce.

Level 1 â€” Learning Transfer

Instead of training from scratch, we use Pre-trained Models. A model trained on 100GB of English can often understand the structure of Swahili after seeing only a few thousand examples.

Few-Shot Learning: Providing just 2-5 examples of a task directly in the prompt is often enough for modern LLMs to perform reasonably well.

Level 2 â€” Data Augmentation

When you don't have data, you make it. Common techniques include:

Back-translation: Translate English to French and back to English to create a slightly different (augmented) sentence.
Thesaurus Substitution: Replace words with synonyms.
Self-Training: Use a model to label unlabeled data, then retrain the model on its own best guesses.

Level 3 â€” Cross-Lingual Projection

Researchers use Parallel Corpora (translations of the same text) to "project" knowledge from a high-resource language like English onto a low-resource one. This involves aligning the vector spaces so that "Apple" and "Poma" (Catalan) overlap perfectly.

Pseudo-Code: Back-Translation

# Example workflow for augmenting a low-resource dataset
original = "The crop yield was very low this year."
en_to_fr = model.translate(original, target="fr") # "La rÃ©colte a Ã©tÃ© trÃ¨s faible cette annÃ©e."
fr_to_en = model.translate(en_to_fr, target="en") # "The harvest was very poor this year."

# Now you have TWO unique sentences for training!

Multilingual NLP

One Model, 100+ Languages

Modern models like mBERT and XLM-RoBERTa are trained on 100+ languages simultaneously. This allows them to create a universal language representation.

Level 1 â€” Multilingual vs Monolingual

A monolingual model is usually more accurate in its specific language (e.g., a pure French BERT), but a multilingual model allows for Zero-Shot Cross-Lingual Transfer: Fine-tune in English, it works in Hindi.

Level 2 â€” Shared Vocabulary

Multilingual models use sub-word tokenization (like BPE or SentencePiece) over a combined dataset. This creates a vocabulary where common roots across languages (like "bio-") are shared, reducing model size.

Level 3 â€” Language Adaptation

In advanced production, we use Adapters. These are small sets of trainable parameters inserted into a massive multilingual model. You only train the "French Adapter" while keeping the main 100-language model weights frozen.

Loading XLM-RoBERTa (HuggingFace)

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# This model understands 100 languages out of the box
model_name = "xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

inputs = tokenizer("Hello, I love NLP!", return_tensors="pt")
inputs_es = tokenizer("Â¡Hola, me encanta el PLN!", return_tensors="pt")