Data Science Lifecycle (CRISP‑DM & Modern Practice)
A successful Data Science project is not only about building a model. It follows a structured lifecycle from understanding the business problem to deploying and monitoring the solution in production.
What is the Data Science Lifecycle?
The Data Science lifecycle describes the standard steps a data project goes through. The most famous framework is CRISP‑DM (Cross‑Industry Standard Process for Data Mining), which is still widely used and has been extended with MLOps concepts.
- Provides a repeatable, reliable process for data projects.
- Helps teams avoid “model in a notebook” that never reaches production.
- Makes communication with business stakeholders more structured.
CRISP‑DM in 6 Phases
1Business Understanding
Translate vague ideas into clear, measurable objectives.
- Define the problem: “Reduce churn by 10%” not just “use AI”.
- Identify stakeholders and success metrics (KPIs).
- List constraints: data availability, time, budget, regulations.
2Data Understanding
Collect and explore the available data sources.
- Connect to databases, data lakes, files, APIs.
- Perform initial EDA (distributions, missing values, anomalies).
- Check data quality and potential data leakage issues.
3Data Preparation
Turn raw data into clean, modeling‑ready datasets.
- Handle missing values, outliers and inconsistent formats.
- Feature engineering (aggregations, encodings, scaling).
- Split data into train / validation / test sets.
4Modeling
Train, validate and select algorithms.
- Try baseline models first (logistic/linear regression, trees).
- Tune hyperparameters (GridSearch, RandomSearch, Bayesian).
- Use cross‑validation to avoid overfitting.
5Evaluation
Check if the model meets business and technical criteria.
- Use correct metrics (ROC‑AUC, F1, RMSE, etc.).
- Perform error analysis and fairness checks.
- Present results with clear visualizations and recommendations.
6Deployment & Monitoring
Put the solution into real‑world use and monitor it.
- Deploy as batch jobs, APIs, or embedded models.
- Monitor data drift, model performance and business KPIs.
- Plan regular retraining and continuous improvement.
Modern Data & ML Pipeline (MLOps View)
In production environments, the lifecycle is implemented as an automated pipeline. Tools like Airflow, Kubeflow, MLflow, and cloud platforms help orchestrate the steps.
Raw Data ──► Ingestion ──► Data Lake / Warehouse
(batch/stream)
Data Lake ──► Feature Engineering ──► Feature Store
Feature Store ──► Model Training ──► Model Registry
▲ │
│ └─► Deployment (API, batch, edge)
└──────────────┬───────────────
▼
Monitoring & Alerts
(data drift, model drift, KPIs)
Best Practices Across the Lifecycle
Reproducibility
- Use version control (Git) for code & configs.
- Fix random seeds for experiments.
- Track datasets, models and metrics (MLflow, DVC).
Collaboration
- Document assumptions and decisions.
- Share notebooks and dashboards with the team.
- Align frequently with business stakeholders.
Responsibility
- Check bias and fairness of models.
- Respect privacy and data regulations (GDPR, etc.).
- Monitor and rollback if a model misbehaves.