XLNet Tutorial

XLNet

Combining bidirectional context with autoregressive generation.

XLNet

XLNet was designed to beat BERT by combining the best of BERT (bidirectional context) and the best of GPT (native generation) using a clever trick called Permutation Language Modeling.

Level 1 — Autoregressive + Bidirectional

BERT uses [MASK] tokens which don't exist in the real world. XLNet avoids [MASK] by predicting words in a random order (permutations), allowing it to see surrounding words without breaking the sentence.

Level 2 — Permutation Math

Instead of just 1-2-3-4, XLNet might train on 3-1-4-2. By the time it predicts word 3, it might have already seen words 1 and 4. This captures context from both directions without needing the [MASK] placeholder.

Level 3 — Long Dependency Modeling

XLNet uses Transformer-XL mechanisms, allowing it to maintain context over extremely long documents where BERT would get cut off after 512 tokens.

XLNet in Transformers

from transformers import XLNetTokenizer, XLNetModel

tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = XLNetModel.from_pretrained('xlnet-base-cased')

inputs = tokenizer("XLNet is powerful for long text.", return_tensors="pt")
outputs = model(**inputs)

Previous: ELECTRA