Data Science Complete Tutorial

Beginner to Advanced ML & AI Focus

Data Science Tutorial

Master Data Science from fundamentals of Python programming to advanced machine learning models with practical implementations in pandas, scikit-learn, and TensorFlow.

Python

Programming basics

Statistics

Probability & inference

Machine Learning

Supervised & unsupervised

Visualization

Matplotlib, Seaborn

Introduction to Data Science

Data Science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract insights and knowledge from structured and unstructured data. It combines statistics, computer science, and domain expertise.

Evolution of Data Science

1960s: Statistical computing
1980s: Data mining & databases
1990s: Knowledge discovery
2000s: Big data era (Hadoop, MapReduce)
2010s: Machine learning boom
2015+: Deep learning & AI revolution
2020+: MLOps & responsible AI

Why Data Science Matters?

Data-driven decision making
Predict trends and behaviors
Automate and optimize processes
Personalize user experiences
Discover hidden patterns
High demand, lucrative careers

Data Science Lifecycle (CRISP-DM)

Business Understanding → Data Understanding → Data Preparation → Modeling → Evaluation → Deployment

Python for Data Science

Python is the primary programming language for data science, with rich libraries for data manipulation, analysis, and modeling.

Python Fundamentals for DS

# Data structures commonly used in DS
import numpy as np
import pandas as pd

# Lists, tuples, dictionaries
data_list = [10, 20, 30, 40, 50]
data_tuple = (1, 2, 3, 4, 5)
data_dict = {'name': 'Alice', 'age': 25, 'city': 'New York'}

# List comprehensions
squared = [x**2 for x in data_list]
print(f"Squared: {squared}")

# Functions for data processing
def clean_data(df):
    """Basic data cleaning function"""
    # Remove duplicates
    df = df.drop_duplicates()
    # Fill missing values
    df = df.fillna(df.mean())
    return df

# Working with NumPy arrays
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(f"Array shape: {arr.shape}")
print(f"Mean: {arr.mean()}, Std: {arr.std()}")

Statistics & Probability

Statistical concepts form the backbone of data analysis and machine learning.

Descriptive Statistics

import numpy as np
import pandas as pd
from scipy import stats

data = np.array([12, 15, 14, 10, 18, 20, 22, 24, 17, 19])

# Central tendency
mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data, keepdims=True).mode[0]

# Dispersion
variance = np.var(data, ddof=1)
std_dev = np.std(data, ddof=1)
iqr = np.percentile(data, 75) - np.percentile(data, 25)

print(f"Mean: {mean:.2f}")
print(f"Median: {median:.2f}")
print(f"Mode: {mode}")
print(f"Variance: {variance:.2f}")
print(f"Std Dev: {std_dev:.2f}")
print(f"IQR: {iqr:.2f}")

Probability Distributions

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Normal distribution
normal_data = np.random.normal(loc=0, scale=1, size=1000)

# Binomial distribution
binom_data = np.random.binomial(n=10, p=0.5, size=1000)

# Poisson distribution
poisson_data = np.random.poisson(lam=3, size=1000)

# Calculate probabilities
z_score = 1.96
prob = stats.norm.cdf(z_score) - stats.norm.cdf(-z_score)
print(f"Probability within ±1.96σ: {prob:.3f}")

# Hypothesis testing example
t_stat, p_value = stats.ttest_1samp(normal_data, 0)
print(f"T-test p-value: {p_value:.4f}")

Data Manipulation with Pandas

Pandas is the essential library for data wrangling and manipulation in Python.

Complete Pandas Workflow

import pandas as pd
import numpy as np

# Create DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [25, 30, 35, np.nan, 28],
    'salary': [50000, 60000, 75000, 80000, 55000],
    'department': ['HR', 'IT', 'IT', 'Finance', 'HR']
})

# Explore data
print(df.head())
print(df.info())
print(df.describe())

# Handle missing values
df['age'].fillna(df['age'].mean(), inplace=True)

# Filtering
it_employees = df[df['department'] == 'IT']
high_earners = df[df['salary'] > 60000]

# Grouping and aggregation
dept_stats = df.groupby('department').agg({
    'salary': ['mean', 'max', 'min'],
    'age': 'mean'
}).round(2)

print("\nDepartment Statistics:")
print(dept_stats)

# Apply functions
df['salary_category'] = df['salary'].apply(
    lambda x: 'High' if x > 60000 else 'Medium' if x > 50000 else 'Low'
)

# Merge and join
df2 = pd.DataFrame({
    'department': ['HR', 'IT', 'Finance'],
    'budget': [100000, 500000, 300000]
})
merged = pd.merge(df, df2, on='department', how='left')

Data Visualization

Visualizing data helps uncover patterns and communicate insights effectively.

Matplotlib

import matplotlib.pyplot as plt
import numpy as np

# Sample data
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)

# Create figure with subplots
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 6))

# Line plot
ax1.plot(x, y1, 'b-', label='sin(x)', linewidth=2)
ax1.plot(x, y2, 'r--', label='cos(x)', linewidth=2)
ax1.set_title('Trigonometric Functions')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Histogram
data = np.random.normal(0, 1, 1000)
ax2.hist(data, bins=30, alpha=0.7, color='steelblue', edgecolor='black')
ax2.set_title('Histogram of Normal Distribution')
ax2.set_xlabel('Value')
ax2.set_ylabel('Frequency')

plt.tight_layout()
plt.show()

Seaborn

import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
tips = sns.load_dataset('tips')

# Set style
sns.set_style('darkgrid')

# Create multiple plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Scatter plot
sns.scatterplot(data=tips, x='total_bill', y='tip', 
                hue='time', size='size', ax=axes[0,0])

# Box plot
sns.boxplot(data=tips, x='day', y='total_bill', 
            hue='sex', ax=axes[0,1])

# Violin plot
sns.violinplot(data=tips, x='day', y='tip', 
               hue='sex', split=True, ax=axes[1,0])

# Heatmap (correlation)
corr = tips.select_dtypes(include=['float64', 'int64']).corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', 
            ax=axes[1,1], fmt='.2f')

plt.tight_layout()
plt.show()

Machine Learning with scikit-learn

Machine learning enables computers to learn from data and make predictions or decisions.

Complete ML Pipeline

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd

# Load data
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# Predict
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)

# Evaluate
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Feature importance
importance = pd.DataFrame({
    'feature': feature_names,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nFeature Importance:")
print(importance)

Supervised Learning Algorithms

Linear Regression

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# Create and train model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predict
y_pred = lr.predict(X_test)

# Metrics
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print(f"R² Score: {r2:.3f}")
print(f"MSE: {mse:.3f}")
print(f"Coefficients: {lr.coef_}")

Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# Classification model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predictions
y_pred = log_reg.predict(X_test)
y_proba = log_reg.predict_proba(X_test)

# Metrics
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")

Random Forest

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=5,
    random_state=42
)
rf.fit(X_train, y_train)

# Feature importance
importances = rf.feature_importances_
print(f"Importances: {importances}")

Unsupervised Learning

K-Means Clustering

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Generate sample data
X, y_true = make_blobs(n_samples=300, centers=4, 
                       cluster_std=0.60, random_state=0)

# Find optimal clusters using elbow method
inertias = []
K_range = range(1, 10)
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

# Plot elbow curve
plt.figure(figsize=(8, 4))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.grid(True)
plt.show()

# Apply K-means with optimal k
kmeans = KMeans(n_clusters=4, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# Visualize clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], 
            kmeans.cluster_centers_[:, 1], 
            marker='X', s=200, linewidths=3, 
            color='red', label='Centroids')
plt.title('K-Means Clustering Results')
plt.legend()
plt.show()

Model Evaluation & Tuning

Proper evaluation and hyperparameter tuning are crucial for building robust models.

Cross-Validation

from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier

# 5-fold cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(rf, X, y, cv=cv, scoring='accuracy')

print(f"CV Scores: {scores}")
print(f"Mean accuracy: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")

Grid Search

from sklearn.model_selection import GridSearchCV

# Parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

# Grid search
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")

Deep Learning with TensorFlow/Keras

Neural Network for Classification

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Build model
model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(4,)),
    layers.Dropout(0.2),
    layers.Dense(32, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(3, activation='softmax')
])

# Compile model
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Train model
history = model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=16,
    validation_split=0.2,
    verbose=0
)

# Evaluate
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"Test accuracy: {test_acc:.3f}")

# Plot training history
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training')
plt.plot(history.history['val_loss'], label='Validation')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training')
plt.plot(history.history['val_accuracy'], label='Validation')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.tight_layout()
plt.show()

Data Science Applications

Fraud Detection

from sklearn.ensemble import IsolationForest

# Anomaly detection for fraud
iso_forest = IsolationForest(contamination=0.1)
outliers = iso_forest.fit_predict(transaction_data)

Customer Churn

# Predict customer churn
from sklearn.ensemble import GradientBoostingClassifier

churn_model = GradientBoostingClassifier()
churn_model.fit(customer_features, churn_labels)

Sales Forecasting

# Time series forecasting
from statsmodels.tsa.arima.model import ARIMA

model = ARIMA(sales_data, order=(1,1,1))
results = model.fit()

Data Science Tools & Libraries

Python

Core programming

Pandas

Data manipulation

Matplotlib

Visualization

scikit-learn

Machine learning

NumPy

Numerical computing

Seaborn

Statistical viz

Statsmodels

Statistical modeling

TensorFlow

Deep learning

                
                    Why Master Data Science?
                
                Highest demand in tech industry (Data Scientist, ML Engineer)
Power modern applications (recommendation, personalization)
Foundation for AI and Machine Learning
Cross-domain applications (healthcare, finance, e-commerce)
Rapidly evolving field with cutting-edge research
Combine statistics, programming, and domain expertise

            

Next: DS Fundamentals