Multilingual NLP Tutorial

Multilingual NLP

Cross-lingual understanding across human languages.

One Model, 100+ Languages

Modern models like mBERT and XLM-RoBERTa are trained on 100+ languages simultaneously. This allows them to create a universal language representation.

Level 1 — Multilingual vs Monolingual

A monolingual model is usually more accurate in its specific language (e.g., a pure French BERT), but a multilingual model allows for Zero-Shot Cross-Lingual Transfer: Fine-tune in English, it works in Hindi.

Level 2 — Shared Vocabulary

Multilingual models use sub-word tokenization (like BPE or SentencePiece) over a combined dataset. This creates a vocabulary where common roots across languages (like "bio-") are shared, reducing model size.

Level 3 — Language Adaptation

In advanced production, we use Adapters. These are small sets of trainable parameters inserted into a massive multilingual model. You only train the "French Adapter" while keeping the main 100-language model weights frozen.

Loading XLM-RoBERTa (HuggingFace)
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# This model understands 100 languages out of the box
model_name = "xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

inputs = tokenizer("Hello, I love NLP!", return_tensors="pt")
inputs_es = tokenizer("¡Hola, me encanta el PLN!", return_tensors="pt")