ML Fundamentals
ML Concepts & Types
# Supervised Learning - Labeled data
- Classification: Predict categorical labels
- Regression: Predict continuous values
# Unsupervised Learning - No labels
- Clustering: Group similar instances
- Dimensionality Reduction: Simplify data
- Anomaly Detection: Find unusual instances
# Semi-supervised Learning - Some labels
- Combination of labeled and unlabeled data
# Reinforcement Learning - Learn from feedback
- Agent learns to make decisions
- Rewards and punishments
# Key ML Concepts
- Features: Input variables (X)
- Labels: Output variables (y) - supervised learning
- Training set: Data used to train the model
- Test set: Data used to evaluate the model
- Validation set: Data used to tune hyperparameters
- Overfitting: Model learns training data too well
- Underfitting: Model fails to learn patterns
- Bias: Error from erroneous assumptions
- Variance: Error from sensitivity to fluctuations
# Model Evaluation Concepts
- Accuracy: Percentage of correct predictions
- Precision: True positives / (True positives + False positives)
- Recall: True positives / (True positives + False negatives)
- F1 Score: Harmonic mean of precision and recall
- ROC Curve: Visualize classification performance
- AUC: Area Under the ROC Curve
- Cross-validation: Robust evaluation technique
Data Preprocessing
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
# Load data
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
# Handle missing values
# Remove rows with missing values
df.dropna(inplace=True)
# Impute missing values
imputer = SimpleImputer(strategy='mean') # or 'median', 'most_frequent'
X_imputed = imputer.fit_transform(X)
# Encode categorical variables
# Label encoding (for ordinal data)
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
# One-hot encoding (for nominal data)
X_encoded = pd.get_dummies(X, columns=['categorical_column'])
# Feature scaling
# Standardization (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Normalization (min=0, max=1)
from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler()
X_normalized = minmax_scaler.fit_transform(X)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Feature selection
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)
Scikit-Learn
Classification Algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
# Logistic Regression
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
# Decision Tree
tree_clf = DecisionTreeClassifier(max_depth=5, random_state=42)
tree_clf.fit(X_train, y_train)
# Random Forest
forest_clf = RandomForestClassifier(
n_estimators=100, max_depth=5, random_state=42
)
forest_clf.fit(X_train, y_train)
# Support Vector Machine
svm_clf = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
svm_clf.fit(X_train, y_train)
# K-Nearest Neighbors
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)
# Gradient Boosting
gb_clf = GradientBoostingClassifier(
n_estimators=100, learning_rate=0.1, random_state=42
)
gb_clf.fit(X_train, y_train)
# Naive Bayes
nb_clf = GaussianNB()
nb_clf.fit(X_train, y_train)
# Model evaluation
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, confusion_matrix, classification_report
)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
cm = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)
Regression & Clustering
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
# Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
# Ridge Regression (L2 regularization)
ridge_reg = Ridge(alpha=1.0, random_state=42)
ridge_reg.fit(X_train, y_train)
# Lasso Regression (L1 regularization)
lasso_reg = Lasso(alpha=0.1, random_state=42)
lasso_reg.fit(X_train, y_train)
# Random Forest Regressor
forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
forest_reg.fit(X_train, y_train)
# Support Vector Regression
svr_reg = SVR(kernel='rbf', C=1.0, gamma='scale')
svr_reg.fit(X_train, y_train)
# Regression evaluation metrics
from sklearn.metrics import (
mean_absolute_error, mean_squared_error, r2_score
)
y_pred = lin_reg.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y極)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
# Clustering algorithms
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
# K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
labels = kmeans.labels_
# DBSCAN (Density-Based Clustering)
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X)
# Hierarchical Clustering
agg_clustering = AgglomerativeClustering(n_clusters=3)
agg_clustering.fit(X)
# Gaussian Mixture Model
gmm = GaussianMixture(n_components=3, random_state=42)
gmm.fit(X)
labels = gmm.predict(X)
# Clustering evaluation metrics
from sklearn.metrics import silhouette_score, calinski_harabasz_score
silhouette_avg = silhouette_score(X, labels)
ch_score = calinski_harabasz_score(X, labels)
Neural Networks
TensorFlow & Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Define a simple sequential model
model = keras.Sequential([
# Input layer
layers.Dense(64, activation='relu', input_shape=(10,)),
# Hidden layers
layers.Dense(128, activation='relu'),
layers.Dropout(0.2),
layers.Dense(64, activation='relu'),
# Output layer
layers.Dense(1, activation='sigmoid')
])
# Compile the model
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
)
# Display the model architecture
model.summary()
# Train the model
history = model.fit(
X_train, y_train,
batch_size=32,
epochs=50,
validation_split=0.2,
verbose=1
)
# Evaluate the model
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f'Test accuracy: {test_acc}')
# Make predictions
predictions = model.predict(X_test)
# Save and load models
model.save('my_model.h5')
loaded_model = keras.models.load_model('my_model.h5')
# Functional API for complex models
inputs = keras.Input(shape=(10,))
x = layers.Dense(64, activation='relu')(inputs)
x = layers.Dense(128, activation='relu')(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs=inputs, outputs=outputs)
PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
# Define a neural network
class NeuralNet(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(NeuralNet, self).__init__()
self.layer1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.layer2 = nn.Linear(hidden_size, hidden_size)
self.layer3 = nn.Linear(hidden_size, num_classes)
self.dropout = nn.Dropout(0.2)
def forward(self, x):
out = self.layer1(x)
out = self.relu(out)
out = self.dropout(out)
out = self.layer2(极)
out = self.relu(out)
out = self.layer3(out)
return out
# Instantiate the model
model = NeuralNet(input_size=10, hidden_size=128, num_classes=1)
# Define loss function and optimizer
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Convert data to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32)
# Create DataLoader
train_dataset = TensorDataset(X_train_tensor, train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
# Training loop
num_epochs = 50
for epoch in range(num_epochs):
for i, (inputs, labels) in enumerate(train_loader):
# Forward pass
outputs = model(inputs)
loss = criterion(outputs, labels.view(-1, 1))
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Model evaluation
model.eval()
with torch.no_grad():
outputs = model(X_test_tensor)
predicted = (torch.sigmoid(outputs) > 0.5).float()
accuracy = (predicted == y_test_tensor.view(-1, 1)).float().mean()
print(f'Test Accuracy: {accuracy.item()}')
Additional Resources
Useful Resources & References
- Scikit-learn: Machine learning in Python
- TensorFlow: Open source ML framework by Google
- PyTorch: Open source ML framework by Facebook
- Keras: High-level neural networks API
- XGBoost: Optimized distributed gradient boosting
- LightGBM: Gradient boosting framework by Microsoft
- CatBoost: High-performance gradient boosting
- OpenCV: Computer vision library
- NLTK: Natural Language Toolkit
- SpaCy: Industrial-strength NLP
- Hugging Face Transformers: State-of-the-art NLP
# Online Courses & Tutorials
- Coursera: Machine Learning by Andrew Ng
- fast.ai: Practical deep learning for coders
- Kaggle Learn: Hands-on data science courses
- Google Machine Learning Crash Course
- Stanford CS229: Machine Learning course
- MIT Introduction to Deep Learning
# Books
- "Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow" by Aurélien Géron
- "Pattern Recognition and Machine Learning" by Christopher M. Bishop
- "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
- "The Hundred-Page Machine Learning Book" by Andriy Burkov
- "Machine Learning Yearning" by Andrew Ng
- "Python Machine Learning" by Sebastian Raschka and Vahid Mirjalili
# Communities & Platforms
- Kaggle: Data science competitions and datasets
- GitHub: Open source ML projects
- arXiv: Latest research papers
- Towards Data Science: ML articles and tutorials
- Stack Overflow: Q&A for ML practitioners
- Reddit: r/MachineLearning, r/datascience
# Datasets Repositories
- UCI Machine Learning Repository
- Kaggle Datasets
- Google Dataset Search
- AWS Open Data Registry
- Microsoft Research Open Data
- Government open data portals
- Academic dataset collections
Comprehensive Machine Learning Notes & Cheatsheet Reference
This Machine Learning Notes & cheatsheet on Nikhil Learn Hub collects syntax, commands, and practical snippets for quick revision. Understand machine learning algorithms, models, data preprocessing, and evaluation techniques with simple examples.
Use the reference cards and examples above during coding sessions; return here instead of scattered searches when you need dependable reminders. Follow the Machine learning learning roadmap when you want structured lessons beyond one-page lookups.
Quick lookup coverage
- Syntax, commands, and API signatures
- Copy-ready examples and common patterns
- Terminology for coursework and interviews
- Cross-links to the matching learning roadmap
How to study with this sheet
- Production debugging and tuning reminders
- Security, performance, or scale cautions
- Integration with adjacent stacks on this site
- Deeper study through tutorials and roadmaps
Who Should Use This Cheatsheet
Students, self-taught developers, and professionals who need fast Machine Learning Notes & lookups during labs, debugging, or interview revision should keep this page bookmarked.
Related Resources on Nikhil Learn Hub
- Machine learning learning roadmapstructured learning path for the same technology
- Cheatsheets hubbrowse all quick-reference sheets
- Technology hubtutorials, roadmaps, and practice hubs