Git Basics for Data Science Projects

Git helps you track changes to your code, notebooks and configuration files. It is essential when collaborating in a Data Science team.

Initialize Repository & First Commit

A typical Git workflow for a Data Science project:

Conceptually, Git stores the history of your project as a series of snapshots. Each commit records the state of the tracked files at a point in time and points to its parent commit, forming a directed acyclic graph. Branches are simply movable pointers to specific commits in this graph.

Initialize a repository inside your project folder.
Add files and create your first commit.
Connect to a remote like GitHub or GitLab.

# Initialize repository
git init

# Track files
git add .

# First commit
git commit -m "Initial data science project setup"

# Add remote (example: GitHub)
git remote add origin https://github.com/user/project.git
git push -u origin main

Branches for Experiments

Use branches for experiments: e.g. trying a new model or feature engineering idea without breaking the main code.

# Create and switch to a new branch
git checkout -b experiment-new-model

# After changes
git add notebooks/new_model.ipynb
git commit -m "Try gradient boosting model"

# Merge back to main
git checkout main
git merge experiment-new-model

Next: Jupyter Notebooks