Architecture Overview
DerivaML is a Python library designed to streamline end-to-end machine learning experiments by integrating data and metadata stored on the Deriva platform. It facilitates a seamless workflow for managing data catalogs, preprocessing, model execution, and result documentation.
Key Components
1. Data Catalog
The catalog includes both a domain schema and a standard ML schema.

- Domain schema — Contains the data collected or generated by domain-specific experiments or systems (e.g., images, clinical records).
-
ML schema (
deriva-ml) — Captures details of the ML development process:Table Purpose Dataset A data collection identified for training, validation, or testing Workflow A specific sequence of computational steps or human interactions Execution An instance of a workflow run at a specific time Execution Asset An output file produced by an execution Execution Metadata Environment metadata files referencing an execution
2. DerivaML Library
The core library provides:
- Execution initialization and lifecycle management
- ML execution context manager with automatic status tracking
- Output upload to the catalog with provenance linking
3. Domain-Specific Libraries
Projects typically create a derived class from DerivaML that adds
domain-specific functionality (e.g., EyeAI for retinal imaging). This
pattern keeps catalog-specific code separate from model implementations.
Next Steps
- Quick Start — Get running in 5 minutes
- Execution Lifecycle — How executions work
- Configuration — Hydra-zen configuration system
- Datasets — Working with datasets