Reinforcement Learning: Interview Q&A

Short questions and answers on RL: agents, environments, rewards, policies, value functions and exploration.

Agent Environment Reward Policy

1 What is reinforcement learning (RL)? ⚡ Beginner

Answer: RL is a learning paradigm where an agent learns by interacting with an environment to maximize cumulative reward over time.

2 What are the main components of an RL problem? ⚡ Beginner

Answer: Key components: states, actions, rewards, policy, value functions and environment dynamics.

3 What is a policy in RL? ⚡ Beginner

Answer: A policy is a mapping from states to actions (deterministic or stochastic) that defines the agent’s behavior.

4 What is the return in RL? 📊 Intermediate

Answer: Return is the cumulative reward from a time step onward, often a discounted sum \(G_t = \sum \gamma^k R_{t+k}\).

5 What is the difference between value function and Q-function? 📊 Intermediate

Answer: The value function \(V^\pi(s)\) gives expected return from a state; the Q-function \(Q^\pi(s,a)\) gives expected return from state–action pairs.

6 What is the exploration–exploitation trade-off? ⚡ Beginner

Answer: The agent must explore new actions to discover good rewards while also exploiting known good actions to maximize reward.

7 What is epsilon-greedy exploration? ⚡ Beginner

Answer: With probability \(1-\epsilon\) choose the best-known action, and with probability \(\epsilon\) choose a random action.

8 What is Q-learning in one sentence? 📊 Intermediate

Answer: Q-learning is an off-policy temporal-difference algorithm that learns the optimal Q-function by bootstrapping from next-state estimates.

9 What is the Bellman equation for the optimal Q-function (informally)? 🔥 Advanced

Answer: It states that the optimal Q satisfies \(Q^*(s,a) = \mathbb{E}[R + \gamma \max_{a'} Q^*(s',a')]\), relating values of successive states.

10 What is the discount factor \(\gamma\) and why is it used? 📊 Intermediate

Answer: \(\gamma \in [0,1)\) makes future rewards worth slightly less than immediate ones, ensuring finite returns and modeling time preference.

11 What is the difference between on-policy and off-policy learning? 🔥 Advanced

Answer: On-policy learns the value of the policy being executed; off-policy learns about a different target policy while possibly following another behavior policy.

12 What is a Markov Decision Process (MDP)? 🔥 Advanced

Answer: An MDP formalizes RL with a set of states, actions, transition probabilities, reward function and discount factor, satisfying the Markov property.

13 What is policy gradient in RL? 🔥 Advanced

Answer: Policy gradient methods directly optimize a parameterized policy by following the gradient of expected return with respect to its parameters.

14 What is the role of a replay buffer in deep RL (e.g., DQN)? 🔥 Advanced

Answer: A replay buffer stores past transitions so the agent can sample mini-batches i.i.d., reducing correlation and improving stability.

15 Why is RL often more sample-inefficient than supervised learning? 🔥 Advanced

Answer: Because the agent must actively explore, rewards are sparse or delayed, and the data distribution depends on the evolving policy.

16 Give some real-world applications of RL. ⚡ Beginner

Answer: Examples: game playing (AlphaGo), robotics control, recommendation systems, dynamic pricing.

17 When would you avoid using RL in practice? 📊 Intermediate

Answer: When you have plenty of labeled data and no sequential decision aspect, supervised/unsupervised methods are usually simpler and more reliable.

18 What is a reward shaping and why is it tricky? 🔥 Advanced

Answer: Reward shaping modifies the reward signal to help learning, but poor shaping can change the optimal policy or encourage unintended behavior.

19 What is the key difference between model-free and model-based RL? 🔥 Advanced

Answer: Model-free methods learn values/policies directly from experience; model-based methods learn a transition/reward model and plan with it.

20 What is the key message to remember about RL? ⚡ Beginner

Answer: RL is about trial-and-error learning to make sequential decisions; mastering states, actions, rewards and value/policy functions gives a solid foundation.

Quick Recap: Reinforcement Learning

Think in loops: observe state, act, get reward, update—this mental model helps you explain and design RL solutions clearly.

Back: Neural Networks Q&A Next: NLP Q&A