Reinforcement Learning
RL fundamentals, agents, rewards, and deep reinforcement learning concepts.
Reinforcement Learning: Learn by Interaction
The Reinforcement Learning Framework
RL is formalized as a Markov Decision Process (MDP): Agent observes state sₜ, takes action aₜ, receives reward rₜ₊â‚, transitions to next state sₜ₊â‚. Goal: maximize cumulative discounted reward.
The agent learns to map states to actions to maximize return Gₜ = Σ γᵠrₜ₊ₖ₊â‚.
Bellman Equations & Dynamic Programming
Bellman Expectation Equations
V^π(s) = Σ π(a|s) [R(s,a) + γ Σ P(s'|s,a) V^π(s')]
Q^π(s,a) = R(s,a) + γ Σ P(s'|s,a) Σ π(a'|s') Q^π(s',a')
Recursive decomposition of value.
Bellman Optimality Equations
V^*(s) = max_a [R(s,a) + γ Σ P(s'|s,a) V^*(s')]
Q^*(s,a) = R(s,a) + γ Σ P(s'|s,a) max_a' Q^*(s',a')
Optimal values satisfy these fixed-point equations.
Policy Iteration
- Evaluate V^Ï€ (solve linear system)
- Improve π: greedy wrt V^π
- Repeat until convergence
Value Iteration
- Initialize V(s)=0
- V(s) ↠max_a [R(s,a) + γ Σ P V(s')]
- Converges to V^*
import numpy as np
def value_iteration(P, R, gamma=0.9, theta=1e-6):
n_states = P.shape[0]
n_actions = P.shape[1]
V = np.zeros(n_states)
while True:
delta = 0
for s in range(n_states):
v = V[s]
# Bellman optimality backup
V[s] = max([sum([P[s, a, s'] * (R[s, a, s'] + gamma * V[s'])
for s' in range(n_states)]) for a in range(n_actions)])
delta = max(delta, abs(v - V[s]))
if delta < theta:
break
# Extract policy
policy = np.zeros(n_states, dtype=int)
for s in range(n_states):
policy[s] = np.argmax([sum([P[s, a, s'] * (R[s, a, s'] + gamma * V[s'])
for s' in range(n_states)]) for a in range(n_actions)])
return policy, V
Model-Free Learning: Monte Carlo & TD
When dynamics (P,R) are unknown, learn from experience.
Monte Carlo (MC)
Complete episodes, average returns.
V(s) ↠V(s) + α [Gₜ - V(s)]
High variance, unbiased.
Temporal Difference (TD0)
Bootstrap: V(s) ↠V(s) + α [r + γV(s') - V(s)]
Lower variance, biased.
TD Error: δ = r + γV(s') - V(s)
TD(λ) / Eligibility Traces
Unify MC and TD. Credit assignment over multiple steps.
V(s) ↠V(s) + α δ e(s)
Q-Learning & SARSA
Q-Learning (Off-Policy)
Q(s,a) ↠Q(s,a) + α [r + γ max_a' Q(s',a') - Q(s,a)]
Learns optimal Q* regardless of behavior policy. Uses max.
Exploration: ε-greedy
SARSA (On-Policy)
Q(s,a) ↠Q(s,a) + α [r + γ Q(s',a') - Q(s,a)]
Learns Q for behavior policy. More stable, safer for live systems.
import gymnasium as gym
import numpy as np
env = gym.make('FrozenLake-v1', is_slippery=True)
n_states = env.observation_space.n
n_actions = env.action_space.n
Q = np.zeros((n_states, n_actions))
alpha = 0.1
gamma = 0.99
epsilon = 0.1
episodes = 10000
for episode in range(episodes):
state, _ = env.reset()
done = False
while not done:
# ε-greedy
if np.random.random() < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(Q[state])
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
# Q-Learning update
best_next = np.max(Q[next_state])
td_target = reward + gamma * best_next * (1 - done)
td_error = td_target - Q[state, action]
Q[state, action] += alpha * td_error
state = next_state
# Evaluate
state, _ = env.reset()
done = False
total_reward = 0
while not done:
action = np.argmax(Q[state])
state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
total_reward += reward
print(f"Test reward: {total_reward}")
Deep Q-Networks (DQN)
When state space is continuous/high-dimensional, use neural networks as Q-function approximators.
DQN Innovations
- Experience Replay: Store transitions (s,a,r,s') in buffer, sample randomly. Breaks correlation.
- Target Network: Fixed Q_target for TD target. Updated periodically.
- Gradient Clipping: Huber loss for stability.
class DQN(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, action_dim)
)
def forward(self, x):
return self.net(x)
# Training step
def optimize_dqn():
if len(replay_buffer) < batch_size:
return
states, actions, rewards, next_states, dones = replay_buffer.sample(batch_size)
# Compute Q(s, a)
q_values = policy_net(states).gather(1, actions)
# Compute target: r + γ max_a' Q_target(s', a')
with torch.no_grad():
next_q_values = target_net(next_states).max(1, keepdim=True)[0]
targets = rewards + gamma * next_q_values * (1 - dones)
loss = nn.HuberLoss()(q_values, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Policy Gradient: REINFORCE
Directly optimize policy π(a|s; θ) using gradient ascent on expected return.
Policy Gradient Theorem
∇J(θ) = E_π [∇log π(a|s; θ) · Q^π(s,a)]
REINFORCE: Monte Carlo estimate of Q^π using Gₜ.
# REINFORCE update
for t in range(episode_len):
G = sum(gamma**k * r[t+k] for k in range(episode_len-t))
loss = -log π(a[t]|s[t]) * G
loss.backward()
Advantage: Reduce Variance
Use baseline b(s): ∇log π · (Gₜ - b(s)). Common: state-value V(s).
A(s,a) = Q(s,a) - V(s) = advantage function.
Actor-Critic Methods
Combine policy-based (actor) and value-based (critic) learning. Actor updates policy in direction suggested by critic.
A2C / A3C
Actor: ∇log π(a|s) * A(s,a)
Critic: TD error δ = r + γV(s') - V(s)
A3C: Asynchronous parallel workers. A2C: synchronous.
PPO – Proximal Policy Optimization
Clipped surrogate objective prevents too large policy updates.
L^CLIP(θ) = E[min(r(θ) A, clip(r(θ), 1-ε, 1+ε) A)]
Default in OpenAI, DeepMind
SAC – Soft Actor-Critic
Maximize reward + entropy → better exploration.
J(π) = Σ E[r + α H(π(·|s))]
State-of-the-art for continuous control.
DDPG / TD3
Deterministic policy gradients for continuous actions. DDPG + twin critics + target policy smoothing = TD3.
Practical RL with Stable-Baselines3
Industry-standard library for RL. Provides tested implementations of PPO, SAC, DQN, etc.
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
# Create environment
env = make_vec_env('CartPole-v1', n_envs=4)
# Initialize PPO
model = PPO(
policy='MlpPolicy',
env=env,
learning_rate=3e-4,
n_steps=2048,
batch_size=64,
n_epochs=10,
gamma=0.99,
verbose=1
)
# Train
model.learn(total_timesteps=100000)
# Save and load
model.save("ppo_cartpole")
model = PPO.load("ppo_cartpole")
# Evaluate
obs = env.reset()
for _ in range(1000):
action, _ = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
Multi-Agent & Advanced RL
MARL
Multiple agents: cooperative, competitive, or mixed.
VDN, QMIX, MADDPG.
Inverse RL
Infer reward function from expert demonstrations.
Hierarchical RL
Options, temporal abstraction.
RL Algorithm Comparison
| Algorithm | Type | Action Space | Policy | Stability | Sample Efficiency |
|---|---|---|---|---|---|
| Q-Learning | Value | Discrete | Off-policy | ââ | ââ |
| DQN | Value | Discrete | Off-policy | âââ | âââ |
| REINFORCE | Policy | Both | On-policy | â | â |
| A2C/A3C | Actor-Critic | Both | On-policy | âââ | ââ |
| PPO | Actor-Critic | Both | On-policy | ââââ | âââ |
| SAC | Actor-Critic | Continuous | Off-policy | ââââ | ââââ |
| TD3 | Actor-Critic | Continuous | Off-policy | ââââ | ââââ |
RL in the Wild
Games
AlphaGo, Dota 5, StarCraft II
Robotics
Manipulation, locomotion
Drug Discovery
Molecule generation
Finance
Portfolio optimization