T5 (Text-to-Text Transfer Transformer) Tutorial

T5 (Text-to-Text Transfer Transformer)

A unified framework where every NLP problem is cast as Text-to-Text.

T5: Text-to-Text Transfer Transformer

Developed by Google, T5 reframed the entire landscape of NLP. Instead of having separate models with different classification heads for classification, translation, and summarization, T5 casts every single NLP task into a text-to-text format.

The model, introduced in the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (Raffel et al., 2019), treats every problem as feeding the model text and training it to output text. This includes translation, question answering, regression (by formatting numbers as text), and even classification.

A Unified Input/Output Strategy

With T5, you prepend a task prefix to the input, and the model outputs plain text.

Translation

Input: "translate English to German: That is good."
Output: "Das ist gut."

Classification (CoLA)

Input: "cola sentence: The course is jumping well."
Output: "not acceptable"

Summarization

Input: "summarize: [Long Article]"
Output: "[Short Summary]"

Because it uses both the Encoder and Decoder, T5 is highly versatile and is currently the base model of choice for many custom LLM fine-tuning tasks (Flan-T5, T5-v1.1, etc.).

Architecture & Variants

T5 follows the original Transformer encoder-decoder architecture, but with a few key modifications: layer normalization is applied outside the residual path, and the model uses a relative position bias instead of absolute position embeddings. This allows better generalization to longer sequences.

Relative position embeddings: T5 adds a bias term to the attention logits based on the offset between positions, making it more robust to varying input lengths.

Model sizes (base through 11B):

T5-Small: 60 million parameters
T5-Base: 220 million
T5-Large: 770 million
T5-3B: 3 billion
T5-11B: 11 billion

Later variants include Flan-T5 (fine-tuned on a massive collection of tasks described via instructions), which improves zero-shot and few-shot capabilities, and mT5 (multilingual T5) trained on mC4 covering 101 languages.

Pre-training: Colossal Clean Crawled Corpus (C4)

T5 was pre-trained on C4, a massive dataset of cleaned English web text. The cleaning process removed duplicates, boilerplate, and low-quality content, resulting in about 750 GB of text.

Pre-training objective: denoising (span corruption) — T5 masks consecutive spans of tokens (not just single tokens like BERT) and learns to reconstruct the missing spans. For example:

Original: Thank you for inviting me to your party last week .
Input (corrupted): Thank you  me to your party  week .
Target:  for inviting  last

This forces the model to understand long-range dependencies and generate fluent text.

The model is trained with a standard maximum likelihood objective (teacher forcing) to predict the target spans.

Fine-tuning and Adapters

Fine-tuning T5 on a downstream task is straightforward: use the same text-to-text format. For tasks like GLUE (cola, sst2, mrpc), the output is a single word ("positive", "negative", "entailment", etc.). For regression (STS-B), the model is trained to output a number string like "3.8".

Tip: Because T5 is an encoder-decoder, it is slower for pure classification tasks than BERT, but it shines in generative tasks like summarization, translation, and question answering.

Parameter‑efficient fine‑tuning (PEFT)

With large T5 variants (e.g., 11B), full fine‑tuning is expensive. Methods like LoRA (Low‑Rank Adaptation) and adapters can be applied to T5. The popular Flan‑T5 models are often used with PEFT for instruction‑tuning on custom datasets.

Hands‑on with T5 (Hugging Face)

Below are realistic examples showing how to use T5 for translation, summarization, and classification. We use the transformers library.

Example 1: Translation (English → French)

Python (T5 translation)

from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = "t5-base"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

input_text = "translate English to French: The house is wonderful."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=40)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)  # "La maison est magnifique."

Example 2: Summarization

Python (T5 summarization)

long_text = """
The T5 model is a transformer-based architecture that converts all NLP tasks into a text-to-text format. 
It was introduced by Google in 2019 and has since become a foundation for many modern language models. 
T5 is pretrained on the colossal C4 dataset using a span corruption objective. 
It can be fine-tuned for tasks like translation, summarization, question answering, and even classification.
""".strip()

input_text = "summarize: " + long_text
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
summary_ids = model.generate(**inputs, max_length=50, min_length=10, length_penalty=2.0)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)  # "T5 is a transformer that converts NLP tasks to text-to-text format. It is pretrained on C4 and can be fine-tuned for many tasks."

Example 3: Text classification (CoLA)

Python (T5 for acceptability)

# Fine-tuned T5 on CoLA can be used like this:
input_acceptable = "cola sentence: The course is jumping well."
inputs = tokenizer(input_acceptable, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=5)
print(tokenizer.decode(out[0], skip_special_tokens=True))  # "not acceptable"

input_acceptable = "cola sentence: The dog barked."
out = model.generate(**inputs, max_new_tokens=5)
print(tokenizer.decode(out[0], skip_special_tokens=True))  # "acceptable"

Note: For demonstration we use the base model without fine‑tuning on CoLA; the actual output may differ. For production, use a checkpoint like textattack/t5-base-cola.

Flan-T5: Instruction Tuning

Flan-T5 is an improved version of T5 fine‑tuned on a massive collection of tasks phrased as instructions (FLAN: Fine‑tuning Language Models). It shows dramatically better zero‑shot and few‑shot performance. The same text‑to‑text format applies, but the prompts are more natural.

Zero‑shot example with Flan‑T5: "Please answer the following question. What is the boiling point of water?" → "100°C"

Using Flan‑T5‑base

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")

prompt = "Answer the following question: Who wrote the book 'The Great Gatsby'?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))  # "F. Scott Fitzgerald"

Flan‑T5 is available in sizes from small to xxl (11B).

How to pre‑train or adapt with span corruption

You can pre‑train T5 from scratch (or continue pre‑training) using the span corruption objective. Below is a simplified data collation example for masking spans:

Span corruption data collator (pseudocode)

def span_corruption(text, mask_rate=0.15, mean_span_length=3):
    tokens = tokenizer.tokenize(text)
    # randomly select spans, replace with <extra_id_*>
    # output: corrupted input and target output
    # Example:
    # input: Thank you <extra_id_0> me to your party <extra_id_1> week .
    # target: <extra_id_0> for inviting <extra_id_1> last
    return corrupted, target

Hugging Face’s transformers provides DataCollatorForT5MLM to handle this automatically.

Performance & tradeoffs

T5 achieves state‑of‑the‑art results on many benchmarks after fine‑tuning, but the encoder‑decoder architecture is computationally heavier than encoder‑only models. For generation, it outperforms BART on summarization and translation. The 11B variant was one of the first models to achieve human‑level performance on SuperGLUE.

⚡ Efficiency tip: For inference with T5, consider using model.generate(..., num_beams=4, early_stopping=True) for better quality, or greedy decoding for speed. Also, half‑precision (fp16) can cut memory usage nearly in half.

Conclusion & next steps

T5’s unified framework revolutionized transfer learning in NLP. Today, its descendants (Flan‑T5, mT5, T5v1.1) are widely used in research and production. The same text‑to‑text paradigm is also adopted by newer models like GPT‑4 (via system prompts) and PaLM 2.

To dive deeper, experiment with fine‑tuning T5 on your own custom dataset using Hugging Face Trainer. Also check out ELECTRA, another efficient pre‑training approach, which we cover next.

Previous: RoBERTa