Reinforcement Learning Q&A
20 Core Questions
Interview Prep
Reinforcement Learning: Interview Q&A
Short questions and answers on RL: agents, environments, rewards, policies, value functions and exploration.
Agent
Environment
Reward
Policy
1
What is reinforcement learning (RL)?
⚡ Beginner
Answer: RL is a learning paradigm where an agent learns by interacting with an environment to maximize cumulative reward over time.
2
What are the main components of an RL problem?
⚡ Beginner
Answer: Key components: states, actions, rewards, policy, value functions and environment dynamics.
3
What is a policy in RL?
⚡ Beginner
Answer: A policy is a mapping from states to actions (deterministic or stochastic) that defines the agent’s behavior.
4
What is the return in RL?
📊 Intermediate
Answer: Return is the cumulative reward from a time step onward, often a discounted sum \(G_t = \sum \gamma^k R_{t+k}\).
5
What is the difference between value function and Q-function?
📊 Intermediate
Answer: The value function \(V^\pi(s)\) gives expected return from a state; the Q-function \(Q^\pi(s,a)\) gives expected return from state–action pairs.
6
What is the exploration–exploitation trade-off?
⚡ Beginner
Answer: The agent must explore new actions to discover good rewards while also exploiting known good actions to maximize reward.
7
What is epsilon-greedy exploration?
⚡ Beginner
Answer: With probability \(1-\epsilon\) choose the best-known action, and with probability \(\epsilon\) choose a random action.
8
What is Q-learning in one sentence?
📊 Intermediate
Answer: Q-learning is an off-policy temporal-difference algorithm that learns the optimal Q-function by bootstrapping from next-state estimates.
9
What is the Bellman equation for the optimal Q-function (informally)?
🔥 Advanced
Answer: It states that the optimal Q satisfies \(Q^*(s,a) = \mathbb{E}[R + \gamma \max_{a'} Q^*(s',a')]\), relating values of successive states.
10
What is the discount factor \(\gamma\) and why is it used?
📊 Intermediate
Answer: \(\gamma \in [0,1)\) makes future rewards worth slightly less than immediate ones, ensuring finite returns and modeling time preference.
11
What is the difference between on-policy and off-policy learning?
🔥 Advanced
Answer: On-policy learns the value of the policy being executed; off-policy learns about a different target policy while possibly following another behavior policy.
12
What is a Markov Decision Process (MDP)?
🔥 Advanced
Answer: An MDP formalizes RL with a set of states, actions, transition probabilities, reward function and discount factor, satisfying the Markov property.
13
What is policy gradient in RL?
🔥 Advanced
Answer: Policy gradient methods directly optimize a parameterized policy by following the gradient of expected return with respect to its parameters.
14
What is the role of a replay buffer in deep RL (e.g., DQN)?
🔥 Advanced
Answer: A replay buffer stores past transitions so the agent can sample mini-batches i.i.d., reducing correlation and improving stability.
15
Why is RL often more sample-inefficient than supervised learning?
🔥 Advanced
Answer: Because the agent must actively explore, rewards are sparse or delayed, and the data distribution depends on the evolving policy.
16
Give some real-world applications of RL.
⚡ Beginner
Answer: Examples: game playing (AlphaGo), robotics control, recommendation systems, dynamic pricing.
17
When would you avoid using RL in practice?
📊 Intermediate
Answer: When you have plenty of labeled data and no sequential decision aspect, supervised/unsupervised methods are usually simpler and more reliable.
18
What is a reward shaping and why is it tricky?
🔥 Advanced
Answer: Reward shaping modifies the reward signal to help learning, but poor shaping can change the optimal policy or encourage unintended behavior.
19
What is the key difference between model-free and model-based RL?
🔥 Advanced
Answer: Model-free methods learn values/policies directly from experience; model-based methods learn a transition/reward model and plan with it.
20
What is the key message to remember about RL?
⚡ Beginner
Answer: RL is about trial-and-error learning to make sequential decisions; mastering states, actions, rewards and value/policy functions gives a solid foundation.
Quick Recap: Reinforcement Learning
Think in loops: observe state, act, get reward, update—this mental model helps you explain and design RL solutions clearly.