Reinforcement Learning Q&A 20 Core Questions
Interview Prep

Reinforcement Learning: Interview Q&A

Short questions and answers on RL: agents, environments, rewards, policies, value functions and exploration.

Agent Environment Reward Policy
1 What is reinforcement learning (RL)? ⚡ Beginner
Answer: RL is a learning paradigm where an agent learns by interacting with an environment to maximize cumulative reward over time.
2 What are the main components of an RL problem? ⚡ Beginner
Answer: Key components: states, actions, rewards, policy, value functions and environment dynamics.
3 What is a policy in RL? ⚡ Beginner
Answer: A policy is a mapping from states to actions (deterministic or stochastic) that defines the agent’s behavior.
4 What is the return in RL? 📊 Intermediate
Answer: Return is the cumulative reward from a time step onward, often a discounted sum \(G_t = \sum \gamma^k R_{t+k}\).
5 What is the difference between value function and Q-function? 📊 Intermediate
Answer: The value function \(V^\pi(s)\) gives expected return from a state; the Q-function \(Q^\pi(s,a)\) gives expected return from state–action pairs.
6 What is the exploration–exploitation trade-off? ⚡ Beginner
Answer: The agent must explore new actions to discover good rewards while also exploiting known good actions to maximize reward.
7 What is epsilon-greedy exploration? ⚡ Beginner
Answer: With probability \(1-\epsilon\) choose the best-known action, and with probability \(\epsilon\) choose a random action.
8 What is Q-learning in one sentence? 📊 Intermediate
Answer: Q-learning is an off-policy temporal-difference algorithm that learns the optimal Q-function by bootstrapping from next-state estimates.
9 What is the Bellman equation for the optimal Q-function (informally)? 🔥 Advanced
Answer: It states that the optimal Q satisfies \(Q^*(s,a) = \mathbb{E}[R + \gamma \max_{a'} Q^*(s',a')]\), relating values of successive states.
10 What is the discount factor \(\gamma\) and why is it used? 📊 Intermediate
Answer: \(\gamma \in [0,1)\) makes future rewards worth slightly less than immediate ones, ensuring finite returns and modeling time preference.
11 What is the difference between on-policy and off-policy learning? 🔥 Advanced
Answer: On-policy learns the value of the policy being executed; off-policy learns about a different target policy while possibly following another behavior policy.
12 What is a Markov Decision Process (MDP)? 🔥 Advanced
Answer: An MDP formalizes RL with a set of states, actions, transition probabilities, reward function and discount factor, satisfying the Markov property.
13 What is policy gradient in RL? 🔥 Advanced
Answer: Policy gradient methods directly optimize a parameterized policy by following the gradient of expected return with respect to its parameters.
14 What is the role of a replay buffer in deep RL (e.g., DQN)? 🔥 Advanced
Answer: A replay buffer stores past transitions so the agent can sample mini-batches i.i.d., reducing correlation and improving stability.
15 Why is RL often more sample-inefficient than supervised learning? 🔥 Advanced
Answer: Because the agent must actively explore, rewards are sparse or delayed, and the data distribution depends on the evolving policy.
16 Give some real-world applications of RL. ⚡ Beginner
Answer: Examples: game playing (AlphaGo), robotics control, recommendation systems, dynamic pricing.
17 When would you avoid using RL in practice? 📊 Intermediate
Answer: When you have plenty of labeled data and no sequential decision aspect, supervised/unsupervised methods are usually simpler and more reliable.
18 What is a reward shaping and why is it tricky? 🔥 Advanced
Answer: Reward shaping modifies the reward signal to help learning, but poor shaping can change the optimal policy or encourage unintended behavior.
19 What is the key difference between model-free and model-based RL? 🔥 Advanced
Answer: Model-free methods learn values/policies directly from experience; model-based methods learn a transition/reward model and plan with it.
20 What is the key message to remember about RL? ⚡ Beginner
Answer: RL is about trial-and-error learning to make sequential decisions; mastering states, actions, rewards and value/policy functions gives a solid foundation.

Quick Recap: Reinforcement Learning

Think in loops: observe state, act, get reward, update—this mental model helps you explain and design RL solutions clearly.