ML Practice
Exercises
Hands-on
Machine Learning Exercises
Use these exercises to reinforce your understanding of ML theory, algorithms and implementation details.
Topic‑1: ML Basics & Theory
- Define supervised, unsupervised and reinforcement learning. For each, give two real‑world examples.
- Explain bias‑variance trade‑off. For three different models (linear, tree, deep net), describe where they typically sit on this spectrum.
- List at least five common sources of data leakage in ML projects and propose a mitigation for each.
- For classification, compare Logistic Regression, k‑NN and Decision Trees in terms of interpretability, training speed and robustness to noise.
Topic‑2: Regression
- On a housing dataset, split into train/validation. Train Linear Regression, Ridge and Lasso; compare RMSE and discuss which features are shrunk or removed.
- Implement gradient descent for univariate Linear Regression in NumPy and verify that the solution matches the closed‑form solution.
- Create polynomial features (degree 2, 3) for a synthetic 1D dataset and show how training/validation error changes with degree and regularization.
Topic‑3: Classification & Metrics
- Write a function that, given y_true and y_pred, computes accuracy, precision, recall, F1‑score and confusion matrix (without using sklearn.metrics).
- On the Titanic or a similar dataset, train at least three classifiers (Logistic Regression, Random Forest, SVM) and compare ROC‑AUC and PR‑AUC.
- Plot ROC and Precision‑Recall curves for an imbalanced dataset and explain when PR‑AUC is more informative than ROC‑AUC.
Topic‑4: Data Preprocessing & Feature Engineering
- Given a mixed‑type tabular dataset, design a preprocessing pipeline that imputes missing values, scales numeric features and encodes categoricals. Implement it with
ColumnTransformer+Pipeline. - Create at least five new domain‑inspired features for the House Price dataset and show how they impact model performance.
- Demonstrate the effect of feature scaling on k‑NN and SVM by training with and without scaling and comparing results.
Topic‑5: Time Series
- Take a univariate time series (e.g., daily sales). Create lag and rolling‑window features and train a tree‑based regressor for one‑step‑ahead forecasting using proper time‑based splits.
- Implement a naive, seasonal naive and simple moving‑average forecast and compare them as baselines against your ML model.
- Perform a train/validation backtest with a rolling window (e.g., 3 folds) and compute MAE/RMSE for each fold.
Topic‑6: NLP
- Build a simple spam/ham SMS classifier using bag‑of‑words + Naive Bayes; then upgrade to TF‑IDF and compare metrics.
- Given a small text corpus, experiment with different tokenization strategies (word, subword, character) and discuss pros/cons.
- Use a pre‑trained transformer (e.g., BERT via Hugging Face) and fine‑tune it for sentiment analysis on a small dataset; measure improvement over classical models.
Topic‑7: Neural Networks & Deep Learning
- Implement a fully‑connected neural network for MNIST using a deep learning framework of your choice; experiment with different activations and regularization (dropout, weight decay).
- Plot training and validation loss curves; identify and fix overfitting using early stopping and data augmentation.
- Re‑implement forward and backward passes for a simple 2‑layer network in pure NumPy to solidify your understanding of backpropagation.
Topic‑8: Pandas, NumPy & Scikit‑Learn
- Using NumPy only, implement standardization and min‑max scaling functions and verify against sklearn’s
StandardScalerandMinMaxScaler. - With pandas, load a raw CSV, perform exploratory analysis (missing values, distributions, correlations) and summarize key data quality issues.
- Build a complete sklearn
Pipeline(preprocessing + model), wrap it inGridSearchCVand report the best configuration and scores.
Topic‑9: Mini Projects & MLOps
- Build a small REST API (FastAPI or Flask) that serves predictions from a trained model, including basic input validation and logging.
- Create a notebook that benchmarks multiple models on the same dataset with clear visualizations and a short written report of conclusions.
- Take an existing Kaggle notebook, refactor it into reusable functions/modules, and add at least two improvements (better features, tuning, or evaluation).