Skip to content

DerivaML

DerivaML is a Python library for reproducible machine learning workflows backed by a Deriva catalog. It captures code provenance, input data versions, configuration, and outputs so experiments can be reproduced, cited, and shared.

!!! info "Upgrading from a previous release?" If you are upgrading an existing project, see Migrating from previous versions for the list of breaking changes and how to update your code.

What deriva-ml does

Four core concepts organize the library:

  • Catalog — the schema + data store. An ERMrest-backed Deriva catalog with domain tables (Subject, Image, Observation, etc.) and an ML schema (Dataset, Execution, Workflow, Feature_Name).
  • Dataset — a versioned, named collection of RIDs. Backs onto catalog snapshots so a named version always resolves the same rows.
  • Execution — a tracked run of a Workflow. Captures inputs, outputs, environment, and status with full provenance.
  • Feature — structured, provenance-linked annotations on existing rows. The unit of record for labels, predictions, and derived metadata.

ERD

When to use deriva-ml

Strong fit:

  • Research labs where data governance, audit trails, and multi-annotator ground truth are first-class requirements.
  • Multi-site collaborations that need citable dataset identifiers and reproducible execution records.
  • Biomedical imaging, clinical records, or similar domains with structured schemas and vocabulary-controlled annotations.

Weaker fit:

  • Quick single-dataset experiments where a folder of files and git is enough. deriva-ml has a non-trivial setup cost; don't pay it without the governance need.
  • Online feature-serving for low-latency inference. deriva-ml is research-oriented; Feast / Tecton are better for that.

See also: Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine Learning Models (Li et al., 2024, IEEE e-Science).

Starting a new project

To start a new deriva-ml project, use the deriva-ml-model-template repository. It provides:

  • Hydra-zen configuration scaffolding
  • CLI entry points (deriva-ml-run, deriva-ml-run-notebook)
  • GitHub Actions for versioning and documentation deployment
  • An example model (CIFAR-10) with config variants

These docs cover the deriva-ml library itself, for developers who already have a project and want to understand the library's concepts and APIs. Start with the User Guide for a task-oriented walkthrough, or jump to the API Reference for per-method documentation.

Further reading

The underlying FAIR-data principles are described in:

Dempsey, William, Ian Foster, Scott Fraser, and Carl Kesselman. "Sharing begins at home: how continuous and ubiquitous FAIRness can enhance research productivity and data reuse." Harvard Data Science Review 4, no. 3 (2022). PDF

The deriva-ml architecture and design decisions are described in:

Li, Zhiwei, Carl Kesselman, Mike D'Arcy, Michael Pazzani, and Benjamin Yizing Xu. "Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine Learning Models." In 2024 IEEE 20th International Conference on e-Science (e-Science), pp. 1-10. IEEE, 2024. PDF