Apache Spark Q&A
1What is Spark?
Answer: Distributed processing engine for batch, SQL, ML, and streaming workloads.
2Why Spark fast?
Answer: In-memory execution and optimized query planning.
3RDD vs DataFrame?
Answer: RDD is low-level; DataFrame is optimized high-level structured API.
4What is Spark SQL?
Answer: Module for SQL queries over structured data.
5What is lazy evaluation?
Answer: Transformations build plan executed only when action is triggered.
6What is partitioning?
Answer: Dividing data for parallel processing across executors.
7What causes shuffle?
Answer: Operations like groupBy/join requiring data redistribution.
8How optimize Spark job?
Answer: Tune partitions, cache wisely, avoid wide shuffles, use broadcast joins.
9What is Spark Structured Streaming?
Answer: High-level streaming API built on DataFrame abstraction.
10One-line summary?
Answer: Spark is a scalable unified engine for large-scale data processing.
Spark interview context
Explain Spark by its execution model: lazy transformations, actions that trigger jobs, and distributed execution through partitions and executors.
Strong answers often include optimization examples such as reducing shuffles, using broadcast joins for small lookup tables, and caching only reused DataFrames.