T5 (Text-to-Text Transfer Transformer)
A unified framework where every NLP problem is cast as Text-to-Text.
T5: Text-to-Text Transfer Transformer
Developed by Google, T5 reframed the entire landscape of NLP. Instead of having separate models with different classification heads for classification, translation, and summarization, T5 casts every single NLP task into a text-to-text format.
The model, introduced in the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (Raffel et al., 2019), treats every problem as feeding the model text and training it to output text. This includes translation, question answering, regression (by formatting numbers as text), and even classification.
A Unified Input/Output Strategy
With T5, you prepend a task prefix to the input, and the model outputs plain text.
Translation
Input: "translate English to German: That is good."
Output: "Das ist gut."
Classification (CoLA)
Input: "cola sentence: The course is jumping well."
Output: "not acceptable"
Summarization
Input: "summarize: [Long Article]"
Output: "[Short Summary]"
Because it uses both the Encoder and Decoder, T5 is highly versatile and is currently the base model of choice for many custom LLM fine-tuning tasks (Flan-T5, T5-v1.1, etc.).
Architecture & Variants
T5 follows the original Transformer encoder-decoder architecture, but with a few key modifications: layer normalization is applied outside the residual path, and the model uses a relative position bias instead of absolute position embeddings. This allows better generalization to longer sequences.
Model sizes (base through 11B):
- T5-Small: 60 million parameters
- T5-Base: 220 million
- T5-Large: 770 million
- T5-3B: 3 billion
- T5-11B: 11 billion
Later variants include Flan-T5 (fine-tuned on a massive collection of tasks described via instructions), which improves zero-shot and few-shot capabilities, and mT5 (multilingual T5) trained on mC4 covering 101 languages.
Pre-training: Colossal Clean Crawled Corpus (C4)
T5 was pre-trained on C4, a massive dataset of cleaned English web text. The cleaning process removed duplicates, boilerplate, and low-quality content, resulting in about 750 GB of text.
Original: Thank you for inviting me to your party last week .
Input (corrupted): Thank you me to your party week .
Target: for inviting last
This forces the model to understand long-range dependencies and generate fluent text.
The model is trained with a standard maximum likelihood objective (teacher forcing) to predict the target spans.
Fine-tuning and Adapters
Fine-tuning T5 on a downstream task is straightforward: use the same text-to-text format. For tasks like GLUE (cola, sst2, mrpc), the output is a single word ("positive", "negative", "entailment", etc.). For regression (STS-B), the model is trained to output a number string like "3.8".
Parameter‑efficient fine‑tuning (PEFT)
With large T5 variants (e.g., 11B), full fine‑tuning is expensive. Methods like LoRA (Low‑Rank Adaptation) and adapters can be applied to T5. The popular Flan‑T5 models are often used with PEFT for instruction‑tuning on custom datasets.
Hands‑on with T5 (Hugging Face)
Below are realistic examples showing how to use T5 for translation, summarization, and classification. We use the transformers library.
from transformers import T5Tokenizer, T5ForConditionalGeneration
model_name = "t5-base"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
input_text = "translate English to French: The house is wonderful."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=40)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation) # "La maison est magnifique."
long_text = """
The T5 model is a transformer-based architecture that converts all NLP tasks into a text-to-text format.
It was introduced by Google in 2019 and has since become a foundation for many modern language models.
T5 is pretrained on the colossal C4 dataset using a span corruption objective.
It can be fine-tuned for tasks like translation, summarization, question answering, and even classification.
""".strip()
input_text = "summarize: " + long_text
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
summary_ids = model.generate(**inputs, max_length=50, min_length=10, length_penalty=2.0)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary) # "T5 is a transformer that converts NLP tasks to text-to-text format. It is pretrained on C4 and can be fine-tuned for many tasks."
# Fine-tuned T5 on CoLA can be used like this:
input_acceptable = "cola sentence: The course is jumping well."
inputs = tokenizer(input_acceptable, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=5)
print(tokenizer.decode(out[0], skip_special_tokens=True)) # "not acceptable"
input_acceptable = "cola sentence: The dog barked."
out = model.generate(**inputs, max_new_tokens=5)
print(tokenizer.decode(out[0], skip_special_tokens=True)) # "acceptable"
Note: For demonstration we use the base model without fine‑tuning on CoLA; the actual output may differ. For production, use a checkpoint like textattack/t5-base-cola.
Flan-T5: Instruction Tuning
Flan-T5 is an improved version of T5 fine‑tuned on a massive collection of tasks phrased as instructions (FLAN: Fine‑tuning Language Models). It shows dramatically better zero‑shot and few‑shot performance. The same text‑to‑text format applies, but the prompts are more natural.
"Please answer the following question. What is the boiling point of water?" → "100°C"
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
prompt = "Answer the following question: Who wrote the book 'The Great Gatsby'?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) # "F. Scott Fitzgerald"
Flan‑T5 is available in sizes from small to xxl (11B).
How to pre‑train or adapt with span corruption
You can pre‑train T5 from scratch (or continue pre‑training) using the span corruption objective. Below is a simplified data collation example for masking spans:
def span_corruption(text, mask_rate=0.15, mean_span_length=3):
tokens = tokenizer.tokenize(text)
# randomly select spans, replace with <extra_id_*>
# output: corrupted input and target output
# Example:
# input: Thank you <extra_id_0> me to your party <extra_id_1> week .
# target: <extra_id_0> for inviting <extra_id_1> last
return corrupted, target
Hugging Face’s transformers provides DataCollatorForT5MLM to handle this automatically.
Performance & tradeoffs
T5 achieves state‑of‑the‑art results on many benchmarks after fine‑tuning, but the encoder‑decoder architecture is computationally heavier than encoder‑only models. For generation, it outperforms BART on summarization and translation. The 11B variant was one of the first models to achieve human‑level performance on SuperGLUE.
model.generate(..., num_beams=4, early_stopping=True) for better quality, or greedy decoding for speed. Also, half‑precision (fp16) can cut memory usage nearly in half.
Conclusion & next steps
T5’s unified framework revolutionized transfer learning in NLP. Today, its descendants (Flan‑T5, mT5, T5v1.1) are widely used in research and production. The same text‑to‑text paradigm is also adopted by newer models like GPT‑4 (via system prompts) and PaLM 2.
To dive deeper, experiment with fine‑tuning T5 on your own custom dataset using Hugging Face Trainer. Also check out ELECTRA, another efficient pre‑training approach, which we cover next.