Git Workflow

DerivaML tracks code provenance by linking execution records to Git commits. Following a proper Git workflow ensures accurate tracking and reproducibility.

Core Principle

Always commit before running. DerivaML captures your Git commit hash when you run a model or notebook. If you have uncommitted changes, the provenance record won't accurately reflect the code that produced your results.

Recommended Workflow

1. Work in Branches

Even for solo projects, use branches:

# Create a feature branch
git checkout -b feature/add-new-model

# Make changes...
# Commit changes...

# Push and create PR
git push -u origin feature/add-new-model

2. Commit Before Running

# Check status
git status

# Stage and commit
git add .
git commit -m "Add new model with extended training"

# Now run
uv run deriva-ml-run model_config=extended

3. Use Meaningful Commits

# Good: Descriptive commit messages
git commit -m "Add dropout regularization to CNN model"
git commit -m "Increase training epochs from 10 to 50"
git commit -m "Fix data loading for multi-class labels"

# Bad: Vague messages
git commit -m "updates"
git commit -m "fix"

4. Tag Significant Runs

Before important runs, create a version tag:

# Create a patch version
uv run bump-version patch

# Or for significant changes
uv run bump-version minor

Debugging Workflow

During development and debugging, use dry runs to avoid creating execution records:

# Dry run - downloads data but doesn't create records
uv run deriva-ml-run dry_run=true

# Make changes based on results
# ...

# Once satisfied, commit and do a real run
git add .
git commit -m "Fix model architecture"
uv run deriva-ml-run

Branch Strategy

main
 │
 ├── feature/new-model
 │    ├── commit: "Add model skeleton"
 │    ├── commit: "Implement training loop"
 │    └── commit: "Add validation metrics"
 │
 └── experiment/hyperparameter-sweep
      ├── commit: "Set up sweep configs"
      └── commit: "Run sweep experiments"

Pull Request Guidelines

One feature per PR: Keep changes focused
Run tests before merging: Ensure code works
Squash if needed: Clean up messy history
Delete branches after merge: Keep repo clean

Gitignore Best Practices

The template .gitignore excludes:

# Environment
.venv/
__pycache__/

# Outputs
outputs/
*.pyc

# Data (stored in DerivaML, not Git)
data/
*.csv
*.pkl

# Secrets
.env
credentials.json

Emergency: Uncommitted Changes

If you accidentally ran with uncommitted changes:

The execution record still exists but provenance is imperfect
Commit your changes immediately
Note the execution RID and the commit hash
Add a comment to the execution record if needed

Working with Large Files

Don't commit large files to Git:

Model weights: Upload to DerivaML as assets
Datasets: Store in DerivaML catalogs
Large outputs: Upload via execution outputs

Use DerivaML to track these instead:

# Register large output for upload
model_path = execution.asset_file_path("Model", "weights.pt")
torch.save(model.state_dict(), model_path)
execution.upload_execution_outputs()