Git Basics for Data Science Projects
Git helps you track changes to your code, notebooks and configuration files. It is essential when collaborating in a Data Science team.
Initialize Repository & First Commit
A typical Git workflow for a Data Science project:
Conceptually, Git stores the history of your project as a series of snapshots. Each commit records the state of the tracked files at a point in time and points to its parent commit, forming a directed acyclic graph. Branches are simply movable pointers to specific commits in this graph.
- Initialize a repository inside your project folder.
- Add files and create your first commit.
- Connect to a remote like GitHub or GitLab.
# Initialize repository
git init
# Track files
git add .
# First commit
git commit -m "Initial data science project setup"
# Add remote (example: GitHub)
git remote add origin https://github.com/user/project.git
git push -u origin main
Branches for Experiments
Use branches for experiments: e.g. trying a new model or feature engineering idea without breaking the main code.
# Create and switch to a new branch
git checkout -b experiment-new-model
# After changes
git add notebooks/new_model.ipynb
git commit -m "Try gradient boosting model"
# Merge back to main
git checkout main
git merge experiment-new-model