Apache Spark
Big Data
PySpark
Apache Spark Basics
Learn how Spark processes large datasets in memory using RDDs and DataFrames, and see a few simple PySpark examples.
What is Apache Spark?
Spark is a fast, general engine for big data processing. It can handle batch processing, streaming, machine learning, and SQL workloads.
- Runs on clusters (YARN, Mesos, Kubernetes, Standalone).
- APIs in Scala, Python (PySpark), Java, R.
- Uses in-memory computation for speed.
Simple PySpark Example
Create SparkSession and DataFrame
# Run this in a PySpark environment (or Jupyter with pyspark installed)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder \
.appName("SimpleExample") \
.getOrCreate()
data = [
("Alice", 25, 50000),
("Bob", 30, 60000),
("Charlie", 35, 70000)
]
columns = ["name", "age", "salary"]
df = spark.createDataFrame(data, columns)
df.show()
# Filter and transform
df_filtered = df.filter(col("salary") > 55000)
df_filtered.show()