RoBERTa
In 2019, Facebook AI researchers proved that BERT was "under-trained" and released RoBERTa (Robustly Optimized BERT Approach). Using the exact same architecture as BERT, it achieved much better results simply by training better.
Level 1 — More Data, More Training
RoBERTa was trained on 160GB of text (vs BERT's 16GB) and for much longer. It's the "Bodybuilder" version of BERT.
Level 2 — The Optimization Secret
RoBERTa made three major training changes:
- Dynamic Masking: Words are masked differently every time the model sees the sentence.
- Removed NSP: Researchers found that "Next Sentence Prediction" didn't actually help.
- Larger Batches: Training on massive batches of data improved stability.
Level 3 — When to use RoBERTa?
If you need an encoder for classification, NER, or similarity, and you have enough GPU memory, RoBERTa-Large is almost always a better choice than BERT-Base.
RoBERTa Sentiment Analysis
from transformers import pipeline
# RoBERTa fine-tuned on sentiment
classifier = pipeline("sentiment-analysis",
model="cardiffnlp/twitter-roberta-base-sentiment")
result = classifier("I love this tutorial!")
print(result)