Data Science Complete Tutorial
Beginner to Advanced ML & AI Focus

Data Science Tutorial

Master Data Science from fundamentals of Python programming to advanced machine learning models with practical implementations in pandas, scikit-learn, and TensorFlow.

Python

Programming basics

Statistics

Probability & inference

Machine Learning

Supervised & unsupervised

Visualization

Matplotlib, Seaborn

Introduction to Data Science

Data Science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract insights and knowledge from structured and unstructured data. It combines statistics, computer science, and domain expertise.

Evolution of Data Science
  • 1960s: Statistical computing
  • 1980s: Data mining & databases
  • 1990s: Knowledge discovery
  • 2000s: Big data era (Hadoop, MapReduce)
  • 2010s: Machine learning boom
  • 2015+: Deep learning & AI revolution
  • 2020+: MLOps & responsible AI
Why Data Science Matters?
  • Data-driven decision making
  • Predict trends and behaviors
  • Automate and optimize processes
  • Personalize user experiences
  • Discover hidden patterns
  • High demand, lucrative careers

Data Science Lifecycle (CRISP-DM)

Business Understanding → Data Understanding → Data Preparation → Modeling → Evaluation → Deployment

Python for Data Science

Python is the primary programming language for data science, with rich libraries for data manipulation, analysis, and modeling.

Python Fundamentals for DS
# Data structures commonly used in DS
import numpy as np
import pandas as pd

# Lists, tuples, dictionaries
data_list = [10, 20, 30, 40, 50]
data_tuple = (1, 2, 3, 4, 5)
data_dict = {'name': 'Alice', 'age': 25, 'city': 'New York'}

# List comprehensions
squared = [x**2 for x in data_list]
print(f"Squared: {squared}")

# Functions for data processing
def clean_data(df):
    """Basic data cleaning function"""
    # Remove duplicates
    df = df.drop_duplicates()
    # Fill missing values
    df = df.fillna(df.mean())
    return df

# Working with NumPy arrays
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(f"Array shape: {arr.shape}")
print(f"Mean: {arr.mean()}, Std: {arr.std()}")

Statistics & Probability

Statistical concepts form the backbone of data analysis and machine learning.

Descriptive Statistics
import numpy as np
import pandas as pd
from scipy import stats

data = np.array([12, 15, 14, 10, 18, 20, 22, 24, 17, 19])

# Central tendency
mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data, keepdims=True).mode[0]

# Dispersion
variance = np.var(data, ddof=1)
std_dev = np.std(data, ddof=1)
iqr = np.percentile(data, 75) - np.percentile(data, 25)

print(f"Mean: {mean:.2f}")
print(f"Median: {median:.2f}")
print(f"Mode: {mode}")
print(f"Variance: {variance:.2f}")
print(f"Std Dev: {std_dev:.2f}")
print(f"IQR: {iqr:.2f}")
Probability Distributions
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Normal distribution
normal_data = np.random.normal(loc=0, scale=1, size=1000)

# Binomial distribution
binom_data = np.random.binomial(n=10, p=0.5, size=1000)

# Poisson distribution
poisson_data = np.random.poisson(lam=3, size=1000)

# Calculate probabilities
z_score = 1.96
prob = stats.norm.cdf(z_score) - stats.norm.cdf(-z_score)
print(f"Probability within ±1.96σ: {prob:.3f}")

# Hypothesis testing example
t_stat, p_value = stats.ttest_1samp(normal_data, 0)
print(f"T-test p-value: {p_value:.4f}")

Data Manipulation with Pandas

Pandas is the essential library for data wrangling and manipulation in Python.

Complete Pandas Workflow
import pandas as pd
import numpy as np

# Create DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [25, 30, 35, np.nan, 28],
    'salary': [50000, 60000, 75000, 80000, 55000],
    'department': ['HR', 'IT', 'IT', 'Finance', 'HR']
})

# Explore data
print(df.head())
print(df.info())
print(df.describe())

# Handle missing values
df['age'].fillna(df['age'].mean(), inplace=True)

# Filtering
it_employees = df[df['department'] == 'IT']
high_earners = df[df['salary'] > 60000]

# Grouping and aggregation
dept_stats = df.groupby('department').agg({
    'salary': ['mean', 'max', 'min'],
    'age': 'mean'
}).round(2)

print("\nDepartment Statistics:")
print(dept_stats)

# Apply functions
df['salary_category'] = df['salary'].apply(
    lambda x: 'High' if x > 60000 else 'Medium' if x > 50000 else 'Low'
)

# Merge and join
df2 = pd.DataFrame({
    'department': ['HR', 'IT', 'Finance'],
    'budget': [100000, 500000, 300000]
})
merged = pd.merge(df, df2, on='department', how='left')

Data Visualization

Visualizing data helps uncover patterns and communicate insights effectively.

Matplotlib
import matplotlib.pyplot as plt
import numpy as np

# Sample data
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)

# Create figure with subplots
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 6))

# Line plot
ax1.plot(x, y1, 'b-', label='sin(x)', linewidth=2)
ax1.plot(x, y2, 'r--', label='cos(x)', linewidth=2)
ax1.set_title('Trigonometric Functions')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Histogram
data = np.random.normal(0, 1, 1000)
ax2.hist(data, bins=30, alpha=0.7, color='steelblue', edgecolor='black')
ax2.set_title('Histogram of Normal Distribution')
ax2.set_xlabel('Value')
ax2.set_ylabel('Frequency')

plt.tight_layout()
plt.show()
Seaborn
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
tips = sns.load_dataset('tips')

# Set style
sns.set_style('darkgrid')

# Create multiple plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Scatter plot
sns.scatterplot(data=tips, x='total_bill', y='tip', 
                hue='time', size='size', ax=axes[0,0])

# Box plot
sns.boxplot(data=tips, x='day', y='total_bill', 
            hue='sex', ax=axes[0,1])

# Violin plot
sns.violinplot(data=tips, x='day', y='tip', 
               hue='sex', split=True, ax=axes[1,0])

# Heatmap (correlation)
corr = tips.select_dtypes(include=['float64', 'int64']).corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', 
            ax=axes[1,1], fmt='.2f')

plt.tight_layout()
plt.show()

Machine Learning with scikit-learn

Machine learning enables computers to learn from data and make predictions or decisions.

Complete ML Pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd

# Load data
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# Predict
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)

# Evaluate
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Feature importance
importance = pd.DataFrame({
    'feature': feature_names,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nFeature Importance:")
print(importance)

Supervised Learning Algorithms

Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# Create and train model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predict
y_pred = lr.predict(X_test)

# Metrics
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print(f"R² Score: {r2:.3f}")
print(f"MSE: {mse:.3f}")
print(f"Coefficients: {lr.coef_}")
Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# Classification model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predictions
y_pred = log_reg.predict(X_test)
y_proba = log_reg.predict_proba(X_test)

# Metrics
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")
Random Forest
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=5,
    random_state=42
)
rf.fit(X_train, y_train)

# Feature importance
importances = rf.feature_importances_
print(f"Importances: {importances}")

Unsupervised Learning

K-Means Clustering
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Generate sample data
X, y_true = make_blobs(n_samples=300, centers=4, 
                       cluster_std=0.60, random_state=0)

# Find optimal clusters using elbow method
inertias = []
K_range = range(1, 10)
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

# Plot elbow curve
plt.figure(figsize=(8, 4))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.grid(True)
plt.show()

# Apply K-means with optimal k
kmeans = KMeans(n_clusters=4, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# Visualize clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], 
            kmeans.cluster_centers_[:, 1], 
            marker='X', s=200, linewidths=3, 
            color='red', label='Centroids')
plt.title('K-Means Clustering Results')
plt.legend()
plt.show()

Model Evaluation & Tuning

Proper evaluation and hyperparameter tuning are crucial for building robust models.

Cross-Validation
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier

# 5-fold cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(rf, X, y, cv=cv, scoring='accuracy')

print(f"CV Scores: {scores}")
print(f"Mean accuracy: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")
Grid Search
from sklearn.model_selection import GridSearchCV

# Parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

# Grid search
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")

Deep Learning with TensorFlow/Keras

Neural Network for Classification
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Build model
model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(4,)),
    layers.Dropout(0.2),
    layers.Dense(32, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(3, activation='softmax')
])

# Compile model
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Train model
history = model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=16,
    validation_split=0.2,
    verbose=0
)

# Evaluate
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"Test accuracy: {test_acc:.3f}")

# Plot training history
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training')
plt.plot(history.history['val_loss'], label='Validation')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training')
plt.plot(history.history['val_accuracy'], label='Validation')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.tight_layout()
plt.show()

Data Science Applications

Fraud Detection
from sklearn.ensemble import IsolationForest

# Anomaly detection for fraud
iso_forest = IsolationForest(contamination=0.1)
outliers = iso_forest.fit_predict(transaction_data)
Customer Churn
# Predict customer churn
from sklearn.ensemble import GradientBoostingClassifier

churn_model = GradientBoostingClassifier()
churn_model.fit(customer_features, churn_labels)
Sales Forecasting
# Time series forecasting
from statsmodels.tsa.arima.model import ARIMA

model = ARIMA(sales_data, order=(1,1,1))
results = model.fit()

Data Science Tools & Libraries

Python

Core programming

Pandas

Data manipulation

Matplotlib

Visualization

scikit-learn

Machine learning

NumPy

Numerical computing

Seaborn

Statistical viz

Statsmodels

Statistical modeling

TensorFlow

Deep learning

Why Master Data Science?

  • Highest demand in tech industry (Data Scientist, ML Engineer)
  • Power modern applications (recommendation, personalization)
  • Foundation for AI and Machine Learning
  • Cross-domain applications (healthcare, finance, e-commerce)
  • Rapidly evolving field with cutting-edge research
  • Combine statistics, programming, and domain expertise