Skip to content

Architecture Overview

DerivaML is a Python library designed to streamline end-to-end machine learning experiments by integrating data and metadata stored on the Deriva platform. It facilitates a seamless workflow for managing data catalogs, preprocessing, model execution, and result documentation.

Key Components

1. Data Catalog

The catalog includes both a domain schema and a standard ML schema.

ERD

  • Domain schema — Contains the data collected or generated by domain-specific experiments or systems (e.g., images, clinical records).
  • ML schema (deriva-ml) — Captures details of the ML development process:

    Table Purpose
    Dataset A data collection identified for training, validation, or testing
    Workflow A specific sequence of computational steps or human interactions
    Execution An instance of a workflow run at a specific time
    Execution Asset An output file produced by an execution
    Execution Metadata Environment metadata files referencing an execution

2. DerivaML Library

The core library provides:

  • Execution initialization and lifecycle management
  • ML execution context manager with automatic status tracking
  • Output upload to the catalog with provenance linking

3. Domain-Specific Libraries

Projects typically create a derived class from DerivaML that adds domain-specific functionality (e.g., EyeAI for retinal imaging). This pattern keeps catalog-specific code separate from model implementations.

Next Steps