Deep Learning

Reinforcement Learning

RL fundamentals, agents, rewards, and deep reinforcement learning concepts.

Reinforcement Learning: Learn by Interaction

The Reinforcement Learning Framework

RL is formalized as a Markov Decision Process (MDP): Agent observes state sₜ, takes action aₜ, receives reward rₜ₊₁, transitions to next state sₜ₊₁. Goal: maximize cumulative discounted reward.

State sₜ → Agent (Policy π) → Action aₜ → Environment → Reward rₜ₊₁, Next state sₜ₊₁

The agent learns to map states to actions to maximize return Gₜ = Σ γᵏ rₜ₊ₖ₊₁.

γ (gamma): discount factor
Ï€ (policy): behavior
V(s): state-value
Q(s,a): action-value

Bellman Equations & Dynamic Programming

Bellman Expectation Equations

V^π(s) = Σ π(a|s) [R(s,a) + γ Σ P(s'|s,a) V^π(s')]

Q^π(s,a) = R(s,a) + γ Σ P(s'|s,a) Σ π(a'|s') Q^π(s',a')

Recursive decomposition of value.

Bellman Optimality Equations

V^*(s) = max_a [R(s,a) + γ Σ P(s'|s,a) V^*(s')]

Q^*(s,a) = R(s,a) + γ Σ P(s'|s,a) max_a' Q^*(s',a')

Optimal values satisfy these fixed-point equations.

Policy Iteration
  1. Evaluate V^Ï€ (solve linear system)
  2. Improve π: greedy wrt V^π
  3. Repeat until convergence
Value Iteration
  1. Initialize V(s)=0
  2. V(s) ← max_a [R(s,a) + γ Σ P V(s')]
  3. Converges to V^*
Value Iteration (Gridworld)
import numpy as np

def value_iteration(P, R, gamma=0.9, theta=1e-6):
    n_states = P.shape[0]
    n_actions = P.shape[1]
    V = np.zeros(n_states)
    
    while True:
        delta = 0
        for s in range(n_states):
            v = V[s]
            # Bellman optimality backup
            V[s] = max([sum([P[s, a, s'] * (R[s, a, s'] + gamma * V[s']) 
                           for s' in range(n_states)]) for a in range(n_actions)])
            delta = max(delta, abs(v - V[s]))
        if delta < theta:
            break
    # Extract policy
    policy = np.zeros(n_states, dtype=int)
    for s in range(n_states):
        policy[s] = np.argmax([sum([P[s, a, s'] * (R[s, a, s'] + gamma * V[s']) 
                                   for s' in range(n_states)]) for a in range(n_actions)])
    return policy, V

Model-Free Learning: Monte Carlo & TD

When dynamics (P,R) are unknown, learn from experience.

Monte Carlo (MC)

Complete episodes, average returns.

V(s) ← V(s) + α [Gₜ - V(s)]

High variance, unbiased.

Temporal Difference (TD0)

Bootstrap: V(s) ← V(s) + α [r + γV(s') - V(s)]

Lower variance, biased.

TD Error: δ = r + γV(s') - V(s)

TD(λ) / Eligibility Traces

Unify MC and TD. Credit assignment over multiple steps.

V(s) ← V(s) + α δ e(s)

Q-Learning & SARSA

Q-Learning (Off-Policy)

Q(s,a) ← Q(s,a) + α [r + γ max_a' Q(s',a') - Q(s,a)]

Learns optimal Q* regardless of behavior policy. Uses max.

Exploration: ε-greedy

SARSA (On-Policy)

Q(s,a) ← Q(s,a) + α [r + γ Q(s',a') - Q(s,a)]

Learns Q for behavior policy. More stable, safer for live systems.

Q-Learning for FrozenLake (OpenAI Gym)
import gymnasium as gym
import numpy as np

env = gym.make('FrozenLake-v1', is_slippery=True)
n_states = env.observation_space.n
n_actions = env.action_space.n
Q = np.zeros((n_states, n_actions))

alpha = 0.1
gamma = 0.99
epsilon = 0.1
episodes = 10000

for episode in range(episodes):
    state, _ = env.reset()
    done = False
    
    while not done:
        # ε-greedy
        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state])
        
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        
        # Q-Learning update
        best_next = np.max(Q[next_state])
        td_target = reward + gamma * best_next * (1 - done)
        td_error = td_target - Q[state, action]
        Q[state, action] += alpha * td_error
        
        state = next_state

# Evaluate
state, _ = env.reset()
done = False
total_reward = 0
while not done:
    action = np.argmax(Q[state])
    state, reward, terminated, truncated, _ = env.step(action)
    done = terminated or truncated
    total_reward += reward
print(f"Test reward: {total_reward}")

Deep Q-Networks (DQN)

When state space is continuous/high-dimensional, use neural networks as Q-function approximators.

DQN Innovations
  • Experience Replay: Store transitions (s,a,r,s') in buffer, sample randomly. Breaks correlation.
  • Target Network: Fixed Q_target for TD target. Updated periodically.
  • Gradient Clipping: Huber loss for stability.
DQN Training Loop (PyTorch)
class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim)
        )
    
    def forward(self, x):
        return self.net(x)

# Training step
def optimize_dqn():
    if len(replay_buffer) < batch_size:
        return
    states, actions, rewards, next_states, dones = replay_buffer.sample(batch_size)
    
    # Compute Q(s, a)
    q_values = policy_net(states).gather(1, actions)
    
    # Compute target: r + γ max_a' Q_target(s', a')
    with torch.no_grad():
        next_q_values = target_net(next_states).max(1, keepdim=True)[0]
        targets = rewards + gamma * next_q_values * (1 - dones)
    
    loss = nn.HuberLoss()(q_values, targets)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
DQN Variants: Double DQN (reduce overestimation), Dueling DQN (separate V and advantage), PER (prioritized replay), Rainbow (combines all).

Policy Gradient: REINFORCE

Directly optimize policy π(a|s; θ) using gradient ascent on expected return.

Policy Gradient Theorem

∇J(θ) = E_π [∇log π(a|s; θ) · Q^π(s,a)]

REINFORCE: Monte Carlo estimate of Q^π using Gₜ.

# REINFORCE update
for t in range(episode_len):
    G = sum(gamma**k * r[t+k] for k in range(episode_len-t))
    loss = -log π(a[t]|s[t]) * G
    loss.backward()
Advantage: Reduce Variance

Use baseline b(s): ∇log π · (Gₜ - b(s)). Common: state-value V(s).

A(s,a) = Q(s,a) - V(s) = advantage function.

Actor-Critic Methods

Combine policy-based (actor) and value-based (critic) learning. Actor updates policy in direction suggested by critic.

A2C / A3C

Actor: ∇log π(a|s) * A(s,a)

Critic: TD error δ = r + γV(s') - V(s)

A3C: Asynchronous parallel workers. A2C: synchronous.

PPO – Proximal Policy Optimization

Clipped surrogate objective prevents too large policy updates.

L^CLIP(θ) = E[min(r(θ) A, clip(r(θ), 1-ε, 1+ε) A)]

Default in OpenAI, DeepMind

SAC – Soft Actor-Critic

Maximize reward + entropy → better exploration.

J(π) = Σ E[r + α H(π(·|s))]

State-of-the-art for continuous control.

DDPG / TD3

Deterministic policy gradients for continuous actions. DDPG + twin critics + target policy smoothing = TD3.

Practical RL with Stable-Baselines3

Industry-standard library for RL. Provides tested implementations of PPO, SAC, DQN, etc.

PPO with Stable-Baselines3
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

# Create environment
env = make_vec_env('CartPole-v1', n_envs=4)

# Initialize PPO
model = PPO(
    policy='MlpPolicy',
    env=env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    verbose=1
)

# Train
model.learn(total_timesteps=100000)

# Save and load
model.save("ppo_cartpole")
model = PPO.load("ppo_cartpole")

# Evaluate
obs = env.reset()
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)

Multi-Agent & Advanced RL

MARL

Multiple agents: cooperative, competitive, or mixed.

VDN, QMIX, MADDPG.

Inverse RL

Infer reward function from expert demonstrations.

Hierarchical RL

Options, temporal abstraction.

RL Algorithm Comparison

Algorithm Type Action Space Policy Stability Sample Efficiency
Q-LearningValueDiscreteOff-policy⭐⭐⭐⭐
DQNValueDiscreteOff-policy⭐⭐⭐⭐⭐⭐
REINFORCEPolicyBothOn-policy⭐⭐
A2C/A3CActor-CriticBothOn-policy⭐⭐⭐⭐⭐
PPOActor-CriticBothOn-policy⭐⭐⭐⭐⭐⭐⭐
SACActor-CriticContinuousOff-policy⭐⭐⭐⭐⭐⭐⭐⭐
TD3Actor-CriticContinuousOff-policy⭐⭐⭐⭐⭐⭐⭐⭐

RL in the Wild

Games

AlphaGo, Dota 5, StarCraft II

Robotics

Manipulation, locomotion

Drug Discovery

Molecule generation

Finance

Portfolio optimization

OpenAI Gym Gymnasium MuJoCo PyBullet Unity ML-Agents

Reinforcement Learning Cheatsheet

MDP (S,A,P,R,γ)
V(s) state value
Q(s,a) action value
TD bootstrap
DQN replay + target
PPO clipped surrogate
SAC max entropy
A2C advantage