Architecture Overview

DerivaML is a Python library designed to streamline end-to-end machine learning experiments by integrating data and metadata stored on the Deriva platform. It facilitates a seamless workflow for managing data catalogs, preprocessing, model execution, and result documentation.

Key Components

1. Data Catalog

The catalog includes both a domain schema and a standard ML schema.

ERD

Domain schema — Contains the data collected or generated by domain-specific experiments or systems (e.g., images, clinical records).

ML schema (deriva-ml) — Captures details of the ML development process:

Table	Purpose
Dataset	A data collection identified for training, validation, or testing
Workflow	A specific sequence of computational steps or human interactions
Execution	An instance of a workflow run at a specific time
Execution Asset	An output file produced by an execution
Execution Metadata	Environment metadata files referencing an execution

2. DerivaML Library

The core library provides:

Execution initialization and lifecycle management
ML execution context manager with automatic status tracking
Output upload to the catalog with provenance linking

3. Domain-Specific Libraries

Projects typically create a derived class from DerivaML that adds domain-specific functionality (e.g., EyeAI for retinal imaging). This pattern keeps catalog-specific code separate from model implementations.

Next Steps

Quick Start — Get running in 5 minutes
Execution Lifecycle — How executions work
Configuration — Hydra-zen configuration system
Datasets — Working with datasets