Multilingual NLP
Cross-lingual understanding across human languages.
One Model, 100+ Languages
Modern models like mBERT and XLM-RoBERTa are trained on 100+ languages simultaneously. This allows them to create a universal language representation.
Level 1 — Multilingual vs Monolingual
A monolingual model is usually more accurate in its specific language (e.g., a pure French BERT), but a multilingual model allows for Zero-Shot Cross-Lingual Transfer: Fine-tune in English, it works in Hindi.
Level 2 — Shared Vocabulary
Multilingual models use sub-word tokenization (like BPE or SentencePiece) over a combined dataset. This creates a vocabulary where common roots across languages (like "bio-") are shared, reducing model size.
Level 3 — Language Adaptation
In advanced production, we use Adapters. These are small sets of trainable parameters inserted into a massive multilingual model. You only train the "French Adapter" while keeping the main 100-language model weights frozen.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# This model understands 100 languages out of the box
model_name = "xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
inputs = tokenizer("Hello, I love NLP!", return_tensors="pt")
inputs_es = tokenizer("¡Hola, me encanta el PLN!", return_tensors="pt")