DerivaML
DerivaML is a Python library for reproducible machine learning workflows backed by a Deriva catalog. It captures code provenance, input data versions, configuration, and outputs so experiments can be reproduced, cited, and shared.
!!! info "Upgrading from a previous release?" If you are upgrading an existing project, see Migrating from previous versions for the list of breaking changes and how to update your code.
What deriva-ml does
Four core concepts organize the library:
- Catalog — the schema + data store. An ERMrest-backed Deriva catalog with domain tables (Subject, Image, Observation, etc.) and an ML schema (Dataset, Execution, Workflow, Feature_Name).
- Dataset — a versioned, named collection of RIDs. Backs onto catalog snapshots so a named version always resolves the same rows.
- Execution — a tracked run of a Workflow. Captures inputs, outputs, environment, and status with full provenance.
- Feature — structured, provenance-linked annotations on existing rows. The unit of record for labels, predictions, and derived metadata.

When to use deriva-ml
Strong fit:
- Research labs where data governance, audit trails, and multi-annotator ground truth are first-class requirements.
- Multi-site collaborations that need citable dataset identifiers and reproducible execution records.
- Biomedical imaging, clinical records, or similar domains with structured schemas and vocabulary-controlled annotations.
Weaker fit:
- Quick single-dataset experiments where a folder of files and git is enough. deriva-ml has a non-trivial setup cost; don't pay it without the governance need.
- Online feature-serving for low-latency inference. deriva-ml is research-oriented; Feast / Tecton are better for that.
See also: Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine Learning Models (Li et al., 2024, IEEE e-Science).
Starting a new project
To start a new deriva-ml project, use the deriva-ml-model-template repository. It provides:
- Hydra-zen configuration scaffolding
- CLI entry points (
deriva-ml-run,deriva-ml-run-notebook) - GitHub Actions for versioning and documentation deployment
- An example model (CIFAR-10) with config variants
These docs cover the deriva-ml library itself, for developers who already have a project and want to understand the library's concepts and APIs. Start with the User Guide for a task-oriented walkthrough, or jump to the API Reference for per-method documentation.
Further reading
The underlying FAIR-data principles are described in:
Dempsey, William, Ian Foster, Scott Fraser, and Carl Kesselman. "Sharing begins at home: how continuous and ubiquitous FAIRness can enhance research productivity and data reuse." Harvard Data Science Review 4, no. 3 (2022). PDF
The deriva-ml architecture and design decisions are described in:
Li, Zhiwei, Carl Kesselman, Mike D'Arcy, Michael Pazzani, and Benjamin Yizing Xu. "Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine Learning Models." In 2024 IEEE 20th International Conference on e-Science (e-Science), pp. 1-10. IEEE, 2024. PDF