Skip to content

Coding Guidelines

This document captures operational practices and coding standards for DerivaML projects. These guidelines ensure reproducibility and maintainability.

Configuration

  • Each model should live in its own repository following this template
  • Use uv to manage all dependencies
  • The generated uv.lock must be committed to the repository
  • It should be possible to rebuild the environment from scratch using uv.lock

Git Workflow

  • Always work in a Git branch and create pull requests, even for solo projects
  • Rebase your branch regularly to keep it up to date with main
  • You MUST commit changes before running models - this maximizes DerivaML's provenance tracking
  • No change is too small to properly track
  • During debugging, use dry_run=true to skip catalog writes

Coding Standards

Documentation

def my_function(param1: str, param2: int) -> bool:
    """Short description of the function.

    Longer description if needed.

    Args:
        param1: Description of param1.
        param2: Description of param2.

    Returns:
        Description of return value.

    Raises:
        ValueError: When something is wrong.
    """

Type Hints

Use type hints wherever possible:

from typing import List, Optional
from deriva_ml import DerivaML
from deriva_ml.execution import Execution

def process_data(
    items: List[str],
    threshold: float = 0.5,
    ml_instance: Optional[DerivaML] = None,
) -> dict[str, float]:
    ...

Linting and Formatting

Run before committing:

# Check for issues
uv run ruff check src/

# Auto-fix issues
uv run ruff check src/ --fix

# Format code
uv run ruff format src/

Signature Verification

Always check function/class signatures before modifying calls:

import inspect
from deriva_ml import DerivaML

# Check before using
print(inspect.signature(DerivaML.__init__))

Or check the source code directly.

Versioning and Releases

  • Use bump-version to create version tags before significant runs
  • DerivaML uses semantic versioning:
  • Major: Breaking changes
  • Minor: New features
  • Patch: Bug fixes
uv run bump-version major|minor|patch

Check current version:

uv run python -m setuptools_scm

Notebooks

  • Never commit notebooks with output cells
  • Install and enable nbstripout: bash uv run nbstripout --install
  • Notebooks should focus on a single task (analysis/visualization)
  • Prefer scripts for training models
  • Notebooks must run start-to-finish without intervention
  • Use deriva-ml-run-notebook for reproducible execution

Executions and Experiments

  • Always run code from hydra-zen configuration files
  • Commit code before running
  • Use dry_run=true during debugging
  • Remove dry_run, tag a version, and run to completion when ready

Data Management

  • Do not commit data files to Git
  • Store all data in DerivaML catalogs
  • Use datasets and assets for input data
  • Use execution outputs for results

Extensibility

DerivaML is designed to be extended via inheritance:

from deriva_ml import DerivaML

class MyDomainML(DerivaML):
    """Domain-specific extensions for my project."""

    def my_custom_method(self):
        """Add domain-specific functionality."""
        ...

Instantiate your custom class in scripts and notebooks for domain-specific functionality.

File Organization

src/
├─ configs/           # All configuration files
│  ├─ __init__.py
│  ├─ base.py
│  ├─ deriva.py
│  ├─ datasets.py
│  ├─ assets.py
│  ├─ workflow.py
│  └─ cifar10_cnn.py
├─ models/            # Model implementations
│  └─ cifar10_cnn.py
└─ scripts/           # Data loading scripts
   └─ load_cifar10.py

Naming Conventions

Item Convention Example
Configuration files snake_case.py cifar10_cnn.py
Configuration names snake_case cifar10_extended
Model functions snake_case train_model
Classes PascalCase MyModelConfig
Constants UPPER_CASE DEFAULT_EPOCHS