Data Science Exercises – Topic-wise Practice

Short, focused exercises to practice statistics, data wrangling, visualization, feature engineering, and model evaluation with Python.

1. Descriptive Statistics & Probability

Exercise 1.1 – Explore a numeric column

Topic: Descriptive stats Level: Easy

Given a CSV with at least one numeric column (e.g., price, age, salary), compute mean, median, standard deviation, min, max and plot a histogram. Write 3–4 bullet points summarizing what you see.

Exercise 1.2 – Confidence interval sketch

Topic: Confidence intervals Level: Intermediate

For the same numeric column, compute a 95% confidence interval for the mean assuming normality. Explain in plain language what this interval means for a non-technical stakeholder.

2. Data Wrangling with Pandas

Exercise 2.1 – Handle missing values

Topic: Missing data Level: Easy

Load a dataset with missing values. Compute missing-value counts per column, then try at least two strategies: dropping rows vs imputing with median/most frequent value. Compare how many rows you lose and why that might matter.

Exercise 2.2 – Groupby and aggregation

Topic: Groupby Level: Easy

Using a sales-like dataset, compute metrics such as total revenue by region, average order value by customer segment, and number of orders per month. Present the results as a small summary table.

3. Visualization

Exercise 3.1 – Compare distributions visually

Topic: Plots Level: Easy

Create side‑by‑side box plots or violin plots for the same numeric variable across at least two categories (e.g., salary by department). Write a short interpretation of which group has higher median and more variance.

4. Modeling & Evaluation

Exercise 4.1 – Train/test split and baseline

Topic: Supervised learning Level: Beginner

Pick a small classification dataset. Split into train/test, fit a simple logistic regression or decision tree, and print accuracy plus a confusion matrix. Write 2–3 sentences about common error types.

Exercise 4.2 – Cross-validation comparison

Topic: Model evaluation Level: Intermediate

Compare two models (e.g., logistic regression vs random forest) using k‑fold cross‑validation on the same dataset. Report mean and standard deviation of F1‑score for each model.