Reinforcement Learning Value-Based Policy-Based
Exploration vs Exploitation Reward Maximization

Reinforcement Learning: Learn by Interaction

Reinforcement Learning is the science of decision making. An agent learns to achieve a goal by interacting with an environment, receiving rewards, and improving its policy. From classical control to mastering Go and robotics.

MDP

(S, A, P, R, γ)

Bellman Eq

Optimality

Deep RL

DQN, PPO, SAC

OpenAI Gym

Environments

The Reinforcement Learning Framework

RL is formalized as a Markov Decision Process (MDP): Agent observes state sₜ, takes action aₜ, receives reward rₜ₊₁, transitions to next state sₜ₊₁. Goal: maximize cumulative discounted reward.

State sₜ Agent (Policy π) Action aₜ Environment Reward rₜ₊₁, Next state sₜ₊₁

The agent learns to map states to actions to maximize return Gₜ = Σ γᵏ rₜ₊ₖ₊₁.

γ (gamma): discount factor
π (policy): behavior
V(s): state-value
Q(s,a): action-value

Bellman Equations & Dynamic Programming

Bellman Expectation Equations

V^π(s) = Σ π(a|s) [R(s,a) + γ Σ P(s'|s,a) V^π(s')]

Q^π(s,a) = R(s,a) + γ Σ P(s'|s,a) Σ π(a'|s') Q^π(s',a')

Recursive decomposition of value.

Bellman Optimality Equations

V^*(s) = max_a [R(s,a) + γ Σ P(s'|s,a) V^*(s')]

Q^*(s,a) = R(s,a) + γ Σ P(s'|s,a) max_a' Q^*(s',a')

Optimal values satisfy these fixed-point equations.

Policy Iteration
  1. Evaluate V^π (solve linear system)
  2. Improve π: greedy wrt V^π
  3. Repeat until convergence
Value Iteration
  1. Initialize V(s)=0
  2. V(s) ← max_a [R(s,a) + γ Σ P V(s')]
  3. Converges to V^*
Value Iteration (Gridworld)
import numpy as np

def value_iteration(P, R, gamma=0.9, theta=1e-6):
    n_states = P.shape[0]
    n_actions = P.shape[1]
    V = np.zeros(n_states)
    
    while True:
        delta = 0
        for s in range(n_states):
            v = V[s]
            # Bellman optimality backup
            V[s] = max([sum([P[s, a, s'] * (R[s, a, s'] + gamma * V[s']) 
                           for s' in range(n_states)]) for a in range(n_actions)])
            delta = max(delta, abs(v - V[s]))
        if delta < theta:
            break
    # Extract policy
    policy = np.zeros(n_states, dtype=int)
    for s in range(n_states):
        policy[s] = np.argmax([sum([P[s, a, s'] * (R[s, a, s'] + gamma * V[s']) 
                                   for s' in range(n_states)]) for a in range(n_actions)])
    return policy, V

Model-Free Learning: Monte Carlo & TD

When dynamics (P,R) are unknown, learn from experience.

Monte Carlo (MC)

Complete episodes, average returns.

V(s) ← V(s) + α [Gₜ - V(s)]

High variance, unbiased.

Temporal Difference (TD0)

Bootstrap: V(s) ← V(s) + α [r + γV(s') - V(s)]

Lower variance, biased.

TD Error: δ = r + γV(s') - V(s)

TD(λ) / Eligibility Traces

Unify MC and TD. Credit assignment over multiple steps.

V(s) ← V(s) + α δ e(s)

Q-Learning & SARSA

Q-Learning (Off-Policy)

Q(s,a) ← Q(s,a) + α [r + γ max_a' Q(s',a') - Q(s,a)]

Learns optimal Q* regardless of behavior policy. Uses max.

Exploration: ε-greedy

SARSA (On-Policy)

Q(s,a) ← Q(s,a) + α [r + γ Q(s',a') - Q(s,a)]

Learns Q for behavior policy. More stable, safer for live systems.

Q-Learning for FrozenLake (OpenAI Gym)
import gymnasium as gym
import numpy as np

env = gym.make('FrozenLake-v1', is_slippery=True)
n_states = env.observation_space.n
n_actions = env.action_space.n
Q = np.zeros((n_states, n_actions))

alpha = 0.1
gamma = 0.99
epsilon = 0.1
episodes = 10000

for episode in range(episodes):
    state, _ = env.reset()
    done = False
    
    while not done:
        # ε-greedy
        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state])
        
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        
        # Q-Learning update
        best_next = np.max(Q[next_state])
        td_target = reward + gamma * best_next * (1 - done)
        td_error = td_target - Q[state, action]
        Q[state, action] += alpha * td_error
        
        state = next_state

# Evaluate
state, _ = env.reset()
done = False
total_reward = 0
while not done:
    action = np.argmax(Q[state])
    state, reward, terminated, truncated, _ = env.step(action)
    done = terminated or truncated
    total_reward += reward
print(f"Test reward: {total_reward}")

Deep Q-Networks (DQN)

When state space is continuous/high-dimensional, use neural networks as Q-function approximators.

DQN Innovations
  • Experience Replay: Store transitions (s,a,r,s') in buffer, sample randomly. Breaks correlation.
  • Target Network: Fixed Q_target for TD target. Updated periodically.
  • Gradient Clipping: Huber loss for stability.
DQN Training Loop (PyTorch)
class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim)
        )
    
    def forward(self, x):
        return self.net(x)

# Training step
def optimize_dqn():
    if len(replay_buffer) < batch_size:
        return
    states, actions, rewards, next_states, dones = replay_buffer.sample(batch_size)
    
    # Compute Q(s, a)
    q_values = policy_net(states).gather(1, actions)
    
    # Compute target: r + γ max_a' Q_target(s', a')
    with torch.no_grad():
        next_q_values = target_net(next_states).max(1, keepdim=True)[0]
        targets = rewards + gamma * next_q_values * (1 - dones)
    
    loss = nn.HuberLoss()(q_values, targets)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
DQN Variants: Double DQN (reduce overestimation), Dueling DQN (separate V and advantage), PER (prioritized replay), Rainbow (combines all).

Policy Gradient: REINFORCE

Directly optimize policy π(a|s; θ) using gradient ascent on expected return.

Policy Gradient Theorem

∇J(θ) = E_π [∇log π(a|s; θ) · Q^π(s,a)]

REINFORCE: Monte Carlo estimate of Q^π using Gₜ.

# REINFORCE update
for t in range(episode_len):
    G = sum(gamma**k * r[t+k] for k in range(episode_len-t))
    loss = -log π(a[t]|s[t]) * G
    loss.backward()
Advantage: Reduce Variance

Use baseline b(s): ∇log π · (Gₜ - b(s)). Common: state-value V(s).

A(s,a) = Q(s,a) - V(s) = advantage function.

Actor-Critic Methods

Combine policy-based (actor) and value-based (critic) learning. Actor updates policy in direction suggested by critic.

A2C / A3C

Actor: ∇log π(a|s) * A(s,a)

Critic: TD error δ = r + γV(s') - V(s)

A3C: Asynchronous parallel workers. A2C: synchronous.

PPO – Proximal Policy Optimization

Clipped surrogate objective prevents too large policy updates.

L^CLIP(θ) = E[min(r(θ) A, clip(r(θ), 1-ε, 1+ε) A)]

Default in OpenAI, DeepMind

SAC – Soft Actor-Critic

Maximize reward + entropy → better exploration.

J(π) = Σ E[r + α H(π(·|s))]

State-of-the-art for continuous control.

DDPG / TD3

Deterministic policy gradients for continuous actions. DDPG + twin critics + target policy smoothing = TD3.

Practical RL with Stable-Baselines3

Industry-standard library for RL. Provides tested implementations of PPO, SAC, DQN, etc.

PPO with Stable-Baselines3
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

# Create environment
env = make_vec_env('CartPole-v1', n_envs=4)

# Initialize PPO
model = PPO(
    policy='MlpPolicy',
    env=env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    verbose=1
)

# Train
model.learn(total_timesteps=100000)

# Save and load
model.save("ppo_cartpole")
model = PPO.load("ppo_cartpole")

# Evaluate
obs = env.reset()
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)

Multi-Agent & Advanced RL

MARL

Multiple agents: cooperative, competitive, or mixed.

VDN, QMIX, MADDPG.

Inverse RL

Infer reward function from expert demonstrations.

Hierarchical RL

Options, temporal abstraction.

RL Algorithm Comparison

Algorithm Type Action Space Policy Stability Sample Efficiency
Q-LearningValueDiscreteOff-policy⭐⭐⭐⭐
DQNValueDiscreteOff-policy⭐⭐⭐⭐⭐⭐
REINFORCEPolicyBothOn-policy
A2C/A3CActor-CriticBothOn-policy⭐⭐⭐⭐⭐
PPOActor-CriticBothOn-policy⭐⭐⭐⭐⭐⭐⭐
SACActor-CriticContinuousOff-policy⭐⭐⭐⭐⭐⭐⭐⭐
TD3Actor-CriticContinuousOff-policy⭐⭐⭐⭐⭐⭐⭐⭐

RL in the Wild

Games

AlphaGo, Dota 5, StarCraft II

Robotics

Manipulation, locomotion

Drug Discovery

Molecule generation

Finance

Portfolio optimization

OpenAI Gym Gymnasium MuJoCo PyBullet Unity ML-Agents

Reinforcement Learning Cheatsheet

MDP (S,A,P,R,γ)
V(s) state value
Q(s,a) action value
TD bootstrap
DQN replay + target
PPO clipped surrogate
SAC max entropy
A2C advantage