Data Science Exercises – Topic-wise Practice
Short, focused exercises to practice statistics, data wrangling, visualization, feature engineering, and model evaluation with Python.
1. Descriptive Statistics & Probability
Given a CSV with at least one numeric column (e.g., price, age, salary), compute mean, median, standard deviation, min, max and plot a histogram. Write 3–4 bullet points summarizing what you see.
For the same numeric column, compute a 95% confidence interval for the mean assuming normality. Explain in plain language what this interval means for a non-technical stakeholder.
2. Data Wrangling with Pandas
Load a dataset with missing values. Compute missing-value counts per column, then try at least two strategies: dropping rows vs imputing with median/most frequent value. Compare how many rows you lose and why that might matter.
Using a sales-like dataset, compute metrics such as total revenue by region, average order value by customer segment, and number of orders per month. Present the results as a small summary table.
3. Visualization
Create side‑by‑side box plots or violin plots for the same numeric variable across at least two categories (e.g., salary by department). Write a short interpretation of which group has higher median and more variance.
4. Modeling & Evaluation
Pick a small classification dataset. Split into train/test, fit a simple logistic regression or decision tree, and print accuracy plus a confusion matrix. Write 2–3 sentences about common error types.
Compare two models (e.g., logistic regression vs random forest) using k‑fold cross‑validation on the same dataset. Report mean and standard deviation of F1‑score for each model.