Data Science Tutorial
Master Data Science from fundamentals of Python programming to advanced machine learning models with practical implementations in pandas, scikit-learn, and TensorFlow.
Python
Programming basics
Statistics
Probability & inference
Machine Learning
Supervised & unsupervised
Visualization
Matplotlib, Seaborn
Introduction to Data Science
Data Science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract insights and knowledge from structured and unstructured data. It combines statistics, computer science, and domain expertise.
Evolution of Data Science
- 1960s: Statistical computing
- 1980s: Data mining & databases
- 1990s: Knowledge discovery
- 2000s: Big data era (Hadoop, MapReduce)
- 2010s: Machine learning boom
- 2015+: Deep learning & AI revolution
- 2020+: MLOps & responsible AI
Why Data Science Matters?
- Data-driven decision making
- Predict trends and behaviors
- Automate and optimize processes
- Personalize user experiences
- Discover hidden patterns
- High demand, lucrative careers
Data Science Lifecycle (CRISP-DM)
Business Understanding → Data Understanding → Data Preparation → Modeling → Evaluation → Deployment
Python for Data Science
Python is the primary programming language for data science, with rich libraries for data manipulation, analysis, and modeling.
# Data structures commonly used in DS
import numpy as np
import pandas as pd
# Lists, tuples, dictionaries
data_list = [10, 20, 30, 40, 50]
data_tuple = (1, 2, 3, 4, 5)
data_dict = {'name': 'Alice', 'age': 25, 'city': 'New York'}
# List comprehensions
squared = [x**2 for x in data_list]
print(f"Squared: {squared}")
# Functions for data processing
def clean_data(df):
"""Basic data cleaning function"""
# Remove duplicates
df = df.drop_duplicates()
# Fill missing values
df = df.fillna(df.mean())
return df
# Working with NumPy arrays
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(f"Array shape: {arr.shape}")
print(f"Mean: {arr.mean()}, Std: {arr.std()}")
Statistics & Probability
Statistical concepts form the backbone of data analysis and machine learning.
Descriptive Statistics
import numpy as np
import pandas as pd
from scipy import stats
data = np.array([12, 15, 14, 10, 18, 20, 22, 24, 17, 19])
# Central tendency
mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data, keepdims=True).mode[0]
# Dispersion
variance = np.var(data, ddof=1)
std_dev = np.std(data, ddof=1)
iqr = np.percentile(data, 75) - np.percentile(data, 25)
print(f"Mean: {mean:.2f}")
print(f"Median: {median:.2f}")
print(f"Mode: {mode}")
print(f"Variance: {variance:.2f}")
print(f"Std Dev: {std_dev:.2f}")
print(f"IQR: {iqr:.2f}")
Probability Distributions
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Normal distribution
normal_data = np.random.normal(loc=0, scale=1, size=1000)
# Binomial distribution
binom_data = np.random.binomial(n=10, p=0.5, size=1000)
# Poisson distribution
poisson_data = np.random.poisson(lam=3, size=1000)
# Calculate probabilities
z_score = 1.96
prob = stats.norm.cdf(z_score) - stats.norm.cdf(-z_score)
print(f"Probability within ±1.96σ: {prob:.3f}")
# Hypothesis testing example
t_stat, p_value = stats.ttest_1samp(normal_data, 0)
print(f"T-test p-value: {p_value:.4f}")
Data Manipulation with Pandas
Pandas is the essential library for data wrangling and manipulation in Python.
import pandas as pd
import numpy as np
# Create DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'age': [25, 30, 35, np.nan, 28],
'salary': [50000, 60000, 75000, 80000, 55000],
'department': ['HR', 'IT', 'IT', 'Finance', 'HR']
})
# Explore data
print(df.head())
print(df.info())
print(df.describe())
# Handle missing values
df['age'].fillna(df['age'].mean(), inplace=True)
# Filtering
it_employees = df[df['department'] == 'IT']
high_earners = df[df['salary'] > 60000]
# Grouping and aggregation
dept_stats = df.groupby('department').agg({
'salary': ['mean', 'max', 'min'],
'age': 'mean'
}).round(2)
print("\nDepartment Statistics:")
print(dept_stats)
# Apply functions
df['salary_category'] = df['salary'].apply(
lambda x: 'High' if x > 60000 else 'Medium' if x > 50000 else 'Low'
)
# Merge and join
df2 = pd.DataFrame({
'department': ['HR', 'IT', 'Finance'],
'budget': [100000, 500000, 300000]
})
merged = pd.merge(df, df2, on='department', how='left')
Data Visualization
Visualizing data helps uncover patterns and communicate insights effectively.
Matplotlib
import matplotlib.pyplot as plt
import numpy as np
# Sample data
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
# Create figure with subplots
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 6))
# Line plot
ax1.plot(x, y1, 'b-', label='sin(x)', linewidth=2)
ax1.plot(x, y2, 'r--', label='cos(x)', linewidth=2)
ax1.set_title('Trigonometric Functions')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Histogram
data = np.random.normal(0, 1, 1000)
ax2.hist(data, bins=30, alpha=0.7, color='steelblue', edgecolor='black')
ax2.set_title('Histogram of Normal Distribution')
ax2.set_xlabel('Value')
ax2.set_ylabel('Frequency')
plt.tight_layout()
plt.show()
Seaborn
import seaborn as sns
import matplotlib.pyplot as plt
# Load dataset
tips = sns.load_dataset('tips')
# Set style
sns.set_style('darkgrid')
# Create multiple plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Scatter plot
sns.scatterplot(data=tips, x='total_bill', y='tip',
hue='time', size='size', ax=axes[0,0])
# Box plot
sns.boxplot(data=tips, x='day', y='total_bill',
hue='sex', ax=axes[0,1])
# Violin plot
sns.violinplot(data=tips, x='day', y='tip',
hue='sex', split=True, ax=axes[1,0])
# Heatmap (correlation)
corr = tips.select_dtypes(include=['float64', 'int64']).corr()
sns.heatmap(corr, annot=True, cmap='coolwarm',
ax=axes[1,1], fmt='.2f')
plt.tight_layout()
plt.show()
Machine Learning with scikit-learn
Machine learning enables computers to learn from data and make predictions or decisions.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd
# Load data
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# Predict
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)
# Evaluate
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
# Feature importance
importance = pd.DataFrame({
'feature': feature_names,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nFeature Importance:")
print(importance)
Supervised Learning Algorithms
Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
# Create and train model
lr = LinearRegression()
lr.fit(X_train, y_train)
# Predict
y_pred = lr.predict(X_test)
# Metrics
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print(f"R² Score: {r2:.3f}")
print(f"MSE: {mse:.3f}")
print(f"Coefficients: {lr.coef_}")
Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
# Classification model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
# Predictions
y_pred = log_reg.predict(X_test)
y_proba = log_reg.predict_proba(X_test)
# Metrics
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")
Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=100,
max_depth=5,
random_state=42
)
rf.fit(X_train, y_train)
# Feature importance
importances = rf.feature_importances_
print(f"Importances: {importances}")
Unsupervised Learning
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generate sample data
X, y_true = make_blobs(n_samples=300, centers=4,
cluster_std=0.60, random_state=0)
# Find optimal clusters using elbow method
inertias = []
K_range = range(1, 10)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
# Plot elbow curve
plt.figure(figsize=(8, 4))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.grid(True)
plt.show()
# Apply K-means with optimal k
kmeans = KMeans(n_clusters=4, random_state=42)
y_kmeans = kmeans.fit_predict(X)
# Visualize clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1],
marker='X', s=200, linewidths=3,
color='red', label='Centroids')
plt.title('K-Means Clustering Results')
plt.legend()
plt.show()
Model Evaluation & Tuning
Proper evaluation and hyperparameter tuning are crucial for building robust models.
Cross-Validation
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
# 5-fold cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(rf, X, y, cv=cv, scoring='accuracy')
print(f"CV Scores: {scores}")
print(f"Mean accuracy: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")
Grid Search
from sklearn.model_selection import GridSearchCV
# Parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10]
}
# Grid search
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
Deep Learning with TensorFlow/Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Build model
model = keras.Sequential([
layers.Dense(64, activation='relu', input_shape=(4,)),
layers.Dropout(0.2),
layers.Dense(32, activation='relu'),
layers.Dropout(0.2),
layers.Dense(3, activation='softmax')
])
# Compile model
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# Train model
history = model.fit(
X_train, y_train,
epochs=50,
batch_size=16,
validation_split=0.2,
verbose=0
)
# Evaluate
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"Test accuracy: {test_acc:.3f}")
# Plot training history
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training')
plt.plot(history.history['val_loss'], label='Validation')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training')
plt.plot(history.history['val_accuracy'], label='Validation')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.tight_layout()
plt.show()
Data Science Applications
Fraud Detection
from sklearn.ensemble import IsolationForest
# Anomaly detection for fraud
iso_forest = IsolationForest(contamination=0.1)
outliers = iso_forest.fit_predict(transaction_data)
Customer Churn
# Predict customer churn
from sklearn.ensemble import GradientBoostingClassifier
churn_model = GradientBoostingClassifier()
churn_model.fit(customer_features, churn_labels)
Sales Forecasting
# Time series forecasting
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(sales_data, order=(1,1,1))
results = model.fit()
Data Science Tools & Libraries
Python
Core programming
Pandas
Data manipulation
Matplotlib
Visualization
scikit-learn
Machine learning
NumPy
Numerical computing
Seaborn
Statistical viz
Statsmodels
Statistical modeling
TensorFlow
Deep learning
Why Master Data Science?
- Highest demand in tech industry (Data Scientist, ML Engineer)
- Power modern applications (recommendation, personalization)
- Foundation for AI and Machine Learning
- Cross-domain applications (healthcare, finance, e-commerce)
- Rapidly evolving field with cutting-edge research
- Combine statistics, programming, and domain expertise