Text Generation Tutorial

Text Generation

Generate creative text with models from Markov to GPT-4.

Text Generation

Natural Language Generation (NLG) focuses on computer systems generating coherent, contextually relevant, and human-like text outputs. It powers creative writing tools, code generation, and automated reporting.

Language Modeling (The Core Theory)

At its heart, text generation usually relies on Autoregressive Language Modeling: calculating the probability distribution of what the next word should be, given all the previous words in the sequence.

Decoding Strategies

Once the model scores the possible next words, how do we pick one?

  • Greedy Decoding: Always pick the single highest probability word. Often leads to repetitive and boring text.
  • Beam Search: Keep track of the top K most probable sequences at each step. Yields highly logical but slightly safe text.
  • Temperature / Top-K / Top-p (Nucleus) Sampling: Add randomness to make the text creative and surprising, while avoiding pure gibberish.

Level 1 — Classical: Markov Chains

Before deep learning, character or word-level Markov chains were used to generate text by looking up statistical transition probabilities based on the previous N items.

Markov Chain Generator
import random

corpus = ["I", "love", "coding", "in", "Python", "because", "I", "love", "learning", "new", "things"]

# Create a bigram transition model
transitions = {}
for i in range(len(corpus) - 1):
    word = corpus[i]
    next_word = corpus[i+1]
    if word not in transitions:
        transitions[word] = []
    transitions[word].append(next_word)

# Generate Text
current_word = "I"
generated = [current_word]

for _ in range(5):
    if current_word in transitions:
        # Pick a random next word based on historical frequency
        next_word = random.choice(transitions[current_word])
        generated.append(next_word)
        current_word = next_word
    else:
        break

print("Generated Text:", " ".join(generated))
# E.g., "I love learning new things" OR "I love coding in Python"

Level 2 — Transformers (GPT Variants)

The Generative Pre-trained Transformer (GPT) family represents the state-of-the-art in text generation. They use deep self-attention to maintain context across paragraphs and pages of text.

Text Generation with GPT-2
from transformers import pipeline

# Load the open-source GPT-2 model pipeline
generator = pipeline("text-generation", model="gpt2")

# We provide a prompt to autocomplate
prompt = "In a shocking turn of events, scientists have discovered"

# Top-k and Top-p sampling ensures creative but coherent output
output = generator(prompt, max_length=50, num_return_sequences=1, 
                   top_k=50, top_p=0.95, temperature=0.7)

print(output[0]['generated_text'])
# Output Example: "In a shocking turn of events, scientists have discovered that the deep ocean vents are home to a previously unknown species of bio-luminescent octopuses that communicate via rapid color shifts."