Machine Learning Notes & Cheatsheet

ML Fundamentals

ML Concepts & Types

# Types of Machine Learning

# Supervised Learning - Labeled data

- Classification: Predict categorical labels

- Regression: Predict continuous values

# Unsupervised Learning - No labels

- Clustering: Group similar instances

- Dimensionality Reduction: Simplify data

- Anomaly Detection: Find unusual instances

# Semi-supervised Learning - Some labels

- Combination of labeled and unlabeled data

# Reinforcement Learning - Learn from feedback

- Agent learns to make decisions

- Rewards and punishments

# Key ML Concepts

- Features: Input variables (X)

- Labels: Output variables (y) - supervised learning

- Training set: Data used to train the model

- Test set: Data used to evaluate the model

- Validation set: Data used to tune hyperparameters

- Overfitting: Model learns training data too well

- Underfitting: Model fails to learn patterns

- Bias: Error from erroneous assumptions

- Variance: Error from sensitivity to fluctuations

# Model Evaluation Concepts

- Accuracy: Percentage of correct predictions

- Precision: True positives / (True positives + False positives)

- Recall: True positives / (True positives + False negatives)

- F1 Score: Harmonic mean of precision and recall

- ROC Curve: Visualize classification performance

- AUC: Area Under the ROC Curve

- Cross-validation: Robust evaluation technique

Data Preprocessing

# Import libraries

import numpy as np

import pandas as pd

from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.impute import SimpleImputer

from sklearn.model_selection import train_test_split

# Load data

df = pd.read_csv('data.csv')

X = df.drop('target', axis=1)

y = df['target']

# Handle missing values

# Remove rows with missing values

df.dropna(inplace=True)

# Impute missing values

imputer = SimpleImputer(strategy='mean')  # or 'median', 'most_frequent'

X_imputed = imputer.fit_transform(X)

# Encode categorical variables

# Label encoding (for ordinal data)

label_encoder = LabelEncoder()

y_encoded = label_encoder.fit_transform(y)

# One-hot encoding (for nominal data)

X_encoded = pd.get_dummies(X, columns=['categorical_column'])

# Feature scaling

# Standardization (mean=0, std=1)

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Normalization (min=0, max=1)

from sklearn.preprocessing import MinMaxScaler

minmax_scaler = MinMaxScaler()

X_normalized = minmax_scaler.fit_transform(X)

# Train-test split

X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size=0.2, random_state=42, stratify=y

)

# Feature selection

from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(score_func=f_classif, k=10)

X_selected = selector.fit_transform(X, y)

Scikit-Learn

Classification Algorithms

# Import classifiers

from sklearn.linear_model import LogisticRegression

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.svm import SVC

from sklearn.neighbors import KNeighborsClassifier

from sklearn.naive_bayes import GaussianNB

# Logistic Regression

log_reg = LogisticRegression(random_state=42)

log_reg.fit(X_train, y_train)

y_pred = log_reg.predict(X_test)

# Decision Tree

tree_clf = DecisionTreeClassifier(max_depth=5, random_state=42)

tree_clf.fit(X_train, y_train)

# Random Forest

forest_clf = RandomForestClassifier(

    n_estimators=100, max_depth=5, random_state=42

)

forest_clf.fit(X_train, y_train)

# Support Vector Machine

svm_clf = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)

svm_clf.fit(X_train, y_train)

# K-Nearest Neighbors

knn_clf = KNeighborsClassifier(n_neighbors=5)

knn_clf.fit(X_train, y_train)

# Gradient Boosting

gb_clf = GradientBoostingClassifier(

    n_estimators=100, learning_rate=0.1, random_state=42

)

gb_clf.fit(X_train, y_train)

# Naive Bayes

nb_clf = GaussianNB()

nb_clf.fit(X_train, y_train)

# Model evaluation

from sklearn.metrics import (

    accuracy_score, precision_score, recall_score,

    f1_score, confusion_matrix, classification_report

)

accuracy = accuracy_score(y_test, y_pred)

precision = precision_score(y_test, y_pred, average='weighted')

recall = recall_score(y_test, y_pred, average='weighted')

f1 = f1_score(y_test, y_pred, average='weighted')

cm = confusion_matrix(y_test, y_pred)

report = classification_report(y_test, y_pred)

Regression & Clustering

# Regression algorithms

from sklearn.linear_model import LinearRegression, Ridge, Lasso

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

from sklearn.svm import SVR

# Linear Regression

lin_reg = LinearRegression()

lin_reg.fit(X_train, y_train)

# Ridge Regression (L2 regularization)

ridge_reg = Ridge(alpha=1.0, random_state=42)

ridge_reg.fit(X_train, y_train)

# Lasso Regression (L1 regularization)

lasso_reg = Lasso(alpha=0.1, random_state=42)

lasso_reg.fit(X_train, y_train)

# Random Forest Regressor

forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)

forest_reg.fit(X_train, y_train)

# Support Vector Regression

svr_reg = SVR(kernel='rbf', C=1.0, gamma='scale')

svr_reg.fit(X_train, y_train)

# Regression evaluation metrics

from sklearn.metrics import (

    mean_absolute_error, mean_squared_error, r2_score

)

y_pred = lin_reg.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)

mse = mean_squared_error(y_test, y極)

rmse = np.sqrt(mse)

r2 = r2_score(y_test, y_pred)

# Clustering algorithms

from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering

from sklearn.mixture import GaussianMixture

# K-Means Clustering

kmeans = KMeans(n_clusters=3, random_state=42)

kmeans.fit(X)

labels = kmeans.labels_

# DBSCAN (Density-Based Clustering)

dbscan = DBSCAN(eps=0.5, min_samples=5)

dbscan.fit(X)

# Hierarchical Clustering

agg_clustering = AgglomerativeClustering(n_clusters=3)

agg_clustering.fit(X)

# Gaussian Mixture Model

gmm = GaussianMixture(n_components=3, random_state=42)

gmm.fit(X)

labels = gmm.predict(X)

# Clustering evaluation metrics

from sklearn.metrics import silhouette_score, calinski_harabasz_score

silhouette_avg = silhouette_score(X, labels)

ch_score = calinski_harabasz_score(X, labels)

Neural Networks

TensorFlow & Keras

# Import TensorFlow and Keras

import tensorflow as tf

from tensorflow import keras

from tensorflow.keras import layers

# Define a simple sequential model

model = keras.Sequential([

    # Input layer

    layers.Dense(64, activation='relu', input_shape=(10,)),

    # Hidden layers

    layers.Dense(128, activation='relu'),

    layers.Dropout(0.2),

    layers.Dense(64, activation='relu'),

    # Output layer

    layers.Dense(1, activation='sigmoid')

])

# Compile the model

model.compile(

    optimizer='adam',

    loss='binary_crossentropy',

    metrics=['accuracy']

)

# Display the model architecture

model.summary()

# Train the model

history = model.fit(

    X_train, y_train,

    batch_size=32,

    epochs=50,

    validation_split=0.2,

    verbose=1

)

# Evaluate the model

test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)

print(f'Test accuracy: {test_acc}')

# Make predictions

predictions = model.predict(X_test)

# Save and load models

model.save('my_model.h5')

loaded_model = keras.models.load_model('my_model.h5')

# Functional API for complex models

inputs = keras.Input(shape=(10,))

x = layers.Dense(64, activation='relu')(inputs)

x = layers.Dense(128, activation='relu')(x)

outputs = layers.Dense(1, activation='sigmoid')(x)

model = keras.Model(inputs=inputs, outputs=outputs)

PyTorch

# Import PyTorch

import torch

import torch.nn as nn

import torch.optim as optim

from torch.utils.data import DataLoader, TensorDataset

# Define a neural network

class NeuralNet(nn.Module):

    def __init__(self, input_size, hidden_size, num_classes):

        super(NeuralNet, self).__init__()

        self.layer1 = nn.Linear(input_size, hidden_size)

        self.relu = nn.ReLU()

        self.layer2 = nn.Linear(hidden_size, hidden_size)

        self.layer3 = nn.Linear(hidden_size, num_classes)

        self.dropout = nn.Dropout(0.2)

    def forward(self, x):

        out = self.layer1(x)

        out = self.relu(out)

        out = self.dropout(out)

        out = self.layer2(极)

        out = self.relu(out)

        out = self.layer3(out)

        return out

# Instantiate the model

model = NeuralNet(input_size=10, hidden_size=128, num_classes=1)

# Define loss function and optimizer

criterion = nn.BCEWithLogitsLoss()

optimizer = optim.Adam(model.parameters(), lr=0.001)

# Convert data to PyTorch tensors

X_train_tensor = torch.tensor(X_train, dtype=torch.float32)

y_train_tensor = torch.tensor(y_train, dtype=torch.float32)

X_test_tensor = torch.tensor(X_test, dtype=torch.float32)

y_test_tensor = torch.tensor(y_test, dtype=torch.float32)

# Create DataLoader

train_dataset = TensorDataset(X_train_tensor, train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Training loop

num_epochs = 50

for epoch in range(num_epochs):

    for i, (inputs, labels) in enumerate(train_loader):

        # Forward pass

        outputs = model(inputs)

        loss = criterion(outputs, labels.view(-1, 1))

        # Backward and optimize

        optimizer.zero_grad()

        loss.backward()

        optimizer.step()

# Model evaluation

model.eval()

with torch.no_grad():

    outputs = model(X_test_tensor)

    predicted = (torch.sigmoid(outputs) > 0.5).float()

    accuracy = (predicted == y_test_tensor.view(-1, 1)).float().mean()

    print(f'Test Accuracy: {accuracy.item()}')

Additional Resources

Useful Resources & References

# Popular ML Libraries

- Scikit-learn: Machine learning in Python

- TensorFlow: Open source ML framework by Google

- PyTorch: Open source ML framework by Facebook

- Keras: High-level neural networks API

- XGBoost: Optimized distributed gradient boosting

- LightGBM: Gradient boosting framework by Microsoft

- CatBoost: High-performance gradient boosting

- OpenCV: Computer vision library

- NLTK: Natural Language Toolkit

- SpaCy: Industrial-strength NLP

- Hugging Face Transformers: State-of-the-art NLP

# Online Courses & Tutorials

- Coursera: Machine Learning by Andrew Ng

- fast.ai: Practical deep learning for coders

- Kaggle Learn: Hands-on data science courses

- Google Machine Learning Crash Course

- Stanford CS229: Machine Learning course

- MIT Introduction to Deep Learning

# Books

- "Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow" by Aurélien Géron

- "Pattern Recognition and Machine Learning" by Christopher M. Bishop

- "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville

- "The Hundred-Page Machine Learning Book" by Andriy Burkov

- "Machine Learning Yearning" by Andrew Ng

- "Python Machine Learning" by Sebastian Raschka and Vahid Mirjalili

# Communities & Platforms

- Kaggle: Data science competitions and datasets

- GitHub: Open source ML projects

- arXiv: Latest research papers

- Towards Data Science: ML articles and tutorials

- Stack Overflow: Q&A for ML practitioners

- Reddit: r/MachineLearning, r/datascience

# Datasets Repositories

- UCI Machine Learning Repository

- Kaggle Datasets

- Google Dataset Search

- AWS Open Data Registry

- Microsoft Research Open Data

- Government open data portals

- Academic dataset collections

Quick reference guide

Comprehensive Machine Learning Notes & Cheatsheet Reference

This Machine Learning Notes & cheatsheet on Nikhil Learn Hub collects syntax, commands, and practical snippets for quick revision. Understand machine learning algorithms, models, data preprocessing, and evaluation techniques with simple examples.

Use the reference cards and examples above during coding sessions; return here instead of scattered searches when you need dependable reminders. Follow the Machine learning learning roadmap when you want structured lessons beyond one-page lookups.

Quick lookup coverage

Syntax, commands, and API signatures
Copy-ready examples and common patterns
Terminology for coursework and interviews
Cross-links to the matching learning roadmap

How to study with this sheet

Production debugging and tuning reminders
Security, performance, or scale cautions
Integration with adjacent stacks on this site
Deeper study through tutorials and roadmaps

Who Should Use This Cheatsheet

Students, self-taught developers, and professionals who need fast Machine Learning Notes & lookups during labs, debugging, or interview revision should keep this page bookmarked.

Related Resources on Nikhil Learn Hub

Machine learning learning roadmapstructured learning path for the same technology
Cheatsheets hubbrowse all quick-reference sheets
Technology hubtutorials, roadmaps, and practice hubs

Related Cheatsheet Links

Machine Learning Cheatsheet