XLNet
Combining bidirectional context with autoregressive generation.
XLNet
XLNet was designed to beat BERT by combining the best of BERT (bidirectional context) and the best of GPT (native generation) using a clever trick called Permutation Language Modeling.
Level 1 — Autoregressive + Bidirectional
BERT uses [MASK] tokens which don't exist in the real world. XLNet avoids [MASK] by predicting words in a random order (permutations), allowing it to see surrounding words without breaking the sentence.
Level 2 — Permutation Math
Instead of just 1-2-3-4, XLNet might train on 3-1-4-2. By the time it predicts word 3, it might have already seen words 1 and 4. This captures context from both directions without needing the [MASK] placeholder.
Level 3 — Long Dependency Modeling
XLNet uses Transformer-XL mechanisms, allowing it to maintain context over extremely long documents where BERT would get cut off after 512 tokens.
from transformers import XLNetTokenizer, XLNetModel
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = XLNetModel.from_pretrained('xlnet-base-cased')
inputs = tokenizer("XLNet is powerful for long text.", return_tensors="pt")
outputs = model(**inputs)