Machine Learning Tutorial
Master Machine Learning from fundamentals of data preprocessing to advanced algorithms like Regression, Classification, Clustering, and Ensemble methods with practical implementations in Scikit-Learn.
Regression
Linear, Polynomial, Ridge
Classification
SVM, Decision Trees, Naive Bayes
Clustering
K-Means, DBSCAN
Ensemble
Random Forest, XGBoost
Introduction to Machine Learning
Machine Learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It focuses on developing computer programs that can access data and use it to learn for themselves.
Evolution of Machine Learning
- 1950s: Alan Turing proposes the Turing Test
- 1957: Perceptron (Frank Rosenblatt)
- 1967: Nearest Neighbor algorithm
- 1980s: Decision Trees & Expert Systems
- 1995: Support Vector Machines (SVM)
- 2000s: Ensemble Methods & Big Data
- 2010s+: Deep Learning & AutoML
Why Machine Learning?
- Automate decision-making processes
- Uncover patterns and insights from data
- Adapt to new data independently
- Power recommendation systems and predictions
- Essential for modern data-driven applications
Types of Machine Learning
Supervised Learning: Learn from labeled data (Regression, Classification).
Unsupervised Learning: Find hidden patterns in unlabeled data (Clustering).
Reinforcement Learning: Learn through trial and error (Game AI).
Data Preprocessing & Feature Engineering
Data preprocessing is the most crucial step in the ML pipeline. Real-world data is often messy and needs to be cleaned and transformed before it can be used for training.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
# Load dataset
df = pd.read_csv('data.csv')
# Handle missing values
imputer = SimpleImputer(strategy='mean')
df['numeric_column'] = imputer.fit_transform(df[['numeric_column']])
# Encode categorical variables
le = LabelEncoder()
df['category'] = le.fit_transform(df['category'])
# Feature scaling
scaler = StandardScaler()
features = ['feature1', 'feature2', 'feature3']
df[features] = scaler.fit_transform(df[features])
# Split data
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Regression Algorithms
Regression algorithms predict a continuous output value based on input features.
Linear Regression
Models the relationship between dependent and independent variables by fitting a linear equation.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
Polynomial & Ridge Regression
Handles non-linear relationships and prevents overfitting with regularization.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
# Polynomial features
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X_train)
# Ridge regression (L2 regularization)
ridge = Ridge(alpha=1.0)
ridge.fit(X_poly, y_train)
Classification Algorithms
Classification algorithms predict categorical class labels.
Logistic Regression
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
probs = clf.predict_proba(X_test)
Best for: Binary classification problems
Decision Trees
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=5)
dt.fit(X_train, y_train)
Best for: Interpretable models
SVM
from sklearn.svm import SVC
svm = SVC(kernel='rbf', C=1.0)
svm.fit(X_train, y_train)
Best for: High-dimensional spaces
Clustering Algorithms (Unsupervised)
Clustering groups similar data points together without using predefined labels.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Find optimal k using elbow method
inertias = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
plt.plot(range(1, 11), inertias, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()
# Fit final model
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X)
Ensemble Methods
Ensemble methods combine multiple models to produce better results than any single model.
Random Forest
An ensemble of decision trees trained on different subsets of data.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=100,
max_depth=10,
random_state=42
)
rf.fit(X_train, y_train)
# Feature importance
importances = rf.feature_importances_
Gradient Boosting (XGBoost)
Sequentially builds models that correct errors of previous ones.
from xgboost import XGBClassifier
xgb = XGBClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3
)
xgb.fit(X_train, y_train)
Model Evaluation & Validation
Proper evaluation is critical to ensure your model generalizes well to unseen data.
Classification Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
Regression Metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
Essential ML Tools & Libraries
Python
Core programming language
Pandas
Data manipulation
NumPy
Numerical computing
Matplotlib
Data visualization
Machine Learning Applications
Finance
- Credit Scoring
- Fraud Detection
- Algorithmic Trading
Healthcare
- Disease Diagnosis
- Drug Discovery
- Patient Risk Stratification
E-commerce
- Recommendation Systems
- Customer Segmentation
- Demand Forecasting
Why Machine Learning?
- Automates complex decision-making processes
- Uncovers hidden insights in large datasets
- Continuously improves with more data
- Powers modern AI applications
- High demand across all industries
- Essential skill for data scientists and analysts