Configuring and Running Executions
Executions are how DerivaML tracks ML workflow runs with full provenance. Every execution records:
- Inputs: Which datasets and assets were used
- Outputs: Which files and datasets were produced
- Timing: When the workflow started and stopped
- Status: Progress updates and completion state
Workflows
A Workflow represents a reusable computational process or analysis pipeline. Workflows are a key part of DerivaML's provenance model — every execution is linked to exactly one workflow, which records what code was run. Understanding workflows is essential before creating executions.
Workflow, Execution, and Workflow Type
These three concepts form a hierarchy:
- Workflow Type — A controlled vocabulary term that categorizes workflows (e.g., "Training",
"Inference"). Managed in the
Workflow_Typevocabulary table. - Workflow — A reusable definition of a computational process. It records the source code location (URL), Git checksum, version, and type. A single workflow can be used by many executions.
- Execution — A specific run of a workflow at a particular time, with particular inputs and outputs. Each execution references exactly one workflow.
Workflow_Type (vocabulary)
└── Workflow (reusable definition)
└── Execution (one specific run)
└── Execution (another run)
└── ...
Creating a Workflow
Use ml.create_workflow() to create a new workflow. This validates the workflow type
against the catalog vocabulary and returns a Workflow object:
workflow = ml.create_workflow(
name="ResNet50 Training",
workflow_type="Training",
description="Fine-tune ResNet50 on medical images"
)
You can call create_workflow() from any Python script, notebook, or interactive
session. Because DerivaML automatically detects the calling source code (see
Automatic Source Code Detection below),
the same create_workflow() call in the same committed script always produces
a Workflow with the same URL and checksum.
The returned Workflow object is not yet registered in the catalog. Registration
happens automatically when you pass it to ml.create_execution(), or you can
register it explicitly with ml.add_workflow(workflow). In either case, if a
workflow with the same URL or checksum already exists in the catalog, the existing
record is reused rather than creating a duplicate. This means you can place
create_workflow() at the top of every script or notebook without worrying
about duplicate records — running the same committed code repeatedly reuses
the same workflow.
Implicit Workflow Creation
When using the hydra-zen configuration system (via deriva-ml-run or
deriva-ml-run-notebook), you typically do not call create_workflow() at all.
Instead, you define a workflow configuration and the framework handles the rest.
See Running Models for the complete guide to setting up
and running with these tools.
# In your configs/workflow.py
from hydra_zen import builds, store
from deriva_ml.execution import Workflow
workflow_store = store(group="workflow")
workflow_store(
builds(
Workflow,
name="ResNet50 Training",
workflow_type="Training",
description="Fine-tune ResNet50 on medical images",
populate_full_signature=True,
),
name="resnet_training",
)
At runtime, hydra-zen instantiates the Workflow object from this configuration,
and run_model() passes it to create_execution() automatically. The source code
detection, catalog registration, and deduplication all happen behind the scenes.
From the command line, you select the workflow by name:
# Use a named workflow configuration
uv run deriva-ml-run workflow=resnet_training
# Or as part of an experiment preset that bundles workflow + model + data
uv run deriva-ml-run +experiment=resnet_imagenet
Automatic Source Code Detection
When a Workflow object is created (either via ml.create_workflow() or the Workflow
constructor), DerivaML automatically detects the source code that is creating the workflow
and records it for provenance. The detection works differently depending on the execution
environment.
Python Scripts
When running from a Python script (e.g., python train.py or uv run deriva-ml-run),
DerivaML identifies the script file, constructs a GitHub blob URL that includes the
current commit hash, and computes a Git object hash of the file content:
URL: https://github.com/org/repo/blob/a1b2c3d/src/models/train.py
Checksum: e5f6a7b8c9d0... (git hash-object of file content)
Version: 0.3.1 (from setuptools-scm or pyproject.toml)
If the script has uncommitted changes, DerivaML issues a warning. The URL still points to the last committed version, so the checksum may not match the code that actually ran. Committing before running ensures reproducibility.
Jupyter Notebooks
When running inside a Jupyter notebook, DerivaML identifies the notebook file by
querying the running Jupyter server for the current kernel's notebook path. The
checksum is computed after stripping cell outputs with nbstripout, so re-running
a notebook without code changes produces the same checksum regardless of output
differences.
For notebooks launched via deriva-ml-run-notebook, the notebook path and URL are
passed through environment variables (DERIVA_ML_WORKFLOW_URL,
DERIVA_ML_WORKFLOW_CHECKSUM) so that detection works even when the notebook is
executed by a separate process.
To ensure clean checksums, install nbstripout in your repository:
pip install nbstripout
nbstripout --install
Docker Containers
When running inside a Docker container (with DERIVA_MCP_IN_DOCKER=true), there is
no local Git repository. Instead, DerivaML reads provenance from environment variables
set at image build time:
| Variable | Purpose |
|---|---|
DERIVA_MCP_IMAGE_NAME |
Docker image name (e.g., ghcr.io/org/repo) |
DERIVA_MCP_IMAGE_DIGEST |
Image digest (sha256:...) used as checksum |
DERIVA_MCP_GIT_COMMIT |
Git commit hash at build time (fallback checksum) |
DERIVA_MCP_VERSION |
Semantic version of the image |
Overriding Detection
You can bypass automatic detection by setting the url and checksum fields
explicitly when constructing a Workflow:
workflow = Workflow(
name="Custom Pipeline",
workflow_type="Training",
url="https://github.com/org/repo/blob/main/pipeline.py",
checksum="abc123def456",
)
Or by setting environment variables before the workflow is created:
export DERIVA_ML_WORKFLOW_URL="https://github.com/org/repo/blob/main/pipeline.py"
export DERIVA_ML_WORKFLOW_CHECKSUM="abc123def456"
Reusing Existing Workflows
If a workflow with the same URL or checksum already exists in the catalog,
ml.add_workflow() returns the existing workflow's RID rather than creating a duplicate.
This means running the same committed script multiple times reuses the same workflow record.
You can also look up existing workflows directly:
# Look up by RID
workflow = ml.lookup_workflow("2-ABC1")
# Look up by URL or Git checksum
workflow = ml.lookup_workflow_by_url("https://github.com/org/repo/blob/abc123/train.py")
# List all workflows in the catalog
all_workflows = ml.find_workflows()
for w in all_workflows:
print(f"{w.name} ({w.workflow_type}): {w.url}")
Updating Workflow Properties
Workflows retrieved from the catalog (via lookup_workflow, lookup_workflow_by_url, or
find_workflows) are bound to the catalog. You can update their description and
workflow_type properties, and the changes are written to the catalog immediately:
workflow = ml.lookup_workflow("2-ABC1")
workflow.description = "Updated: now includes data augmentation"
workflow.workflow_type = "Training" # Must be a valid Workflow_Type term
Workflow Types
Workflow types are controlled vocabulary terms that categorize workflows. Common types include:
| Type | Description |
|---|---|
| Training | Model training workflows |
| Inference | Running predictions on new data |
| Preprocessing | Data cleaning and transformation |
| Evaluation | Model evaluation and metrics |
| Annotation | Adding labels or features |
These are not fixed — add custom types for your project:
ml.add_term(
table="Workflow_Type",
term_name="Data_Augmentation",
description="Workflows that augment training data"
)
The workflow type must exist in the catalog before creating a workflow that uses it.
ml.create_workflow() validates this and raises a DerivaMLException if the type is
not found.
Providing the Workflow to an Execution
A workflow must be provided when creating an execution. There are two ways:
# Option 1: In the ExecutionConfiguration
config = ExecutionConfiguration(
workflow=workflow,
description="Training run",
datasets=[DatasetSpec(rid="1-ABC")],
)
exe = ml.create_execution(config)
# Option 2: As a separate argument to create_execution
config = ExecutionConfiguration(
description="Training run",
datasets=[DatasetSpec(rid="1-ABC")],
)
exe = ml.create_execution(config, workflow=workflow)
If no workflow is provided in either place, a DerivaMLException is raised.
Execution Lifecycle
The execution workflow follows these steps:
┌─────────────────────────────────────────────────────────────────┐
│ 1. Create Execution Configuration │
│ - Specify workflow │
│ - Declare input datasets and assets │
├─────────────────────────────────────────────────────────────────┤
│ 2. Create and Start Execution │
│ - Context manager handles timing automatically │
│ - Input datasets/assets are recorded │
├─────────────────────────────────────────────────────────────────┤
│ 3. Run ML Workflow │
│ - Download datasets as needed │
│ - Process data, train models, run inference │
│ - Register output files with asset_file_path() │
├─────────────────────────────────────────────────────────────────┤
│ 4. Upload Outputs │
│ - Call upload_execution_outputs() after context exits │
│ - Files are uploaded to Hatrac object store │
│ - Provenance links are created │
└─────────────────────────────────────────────────────────────────┘
Execution Status Lifecycle
Each execution progresses through a series of status states. The transitions are managed automatically by the context manager, but you can also update status manually to report progress.
created ──► initializing ──► pending ──► running ──► completed
│
└──────► failed
| Status | Value | When Set | Description |
|---|---|---|---|
created |
"Created" |
__init__ |
Initial state after the Execution record is inserted into the catalog. |
initializing |
"Initializing" |
_initialize_execution |
Downloading input datasets and assets, saving configuration metadata. |
pending |
"Pending" |
End of _initialize_execution |
All inputs are downloaded and ready; waiting for the model to start. |
running |
"Running" |
__enter__ / execution_start |
The model function is executing. Also used for manual progress updates. |
completed |
"Completed" |
__exit__ (no exception) |
The execution finished successfully. Duration is recorded. |
failed |
"Failed" |
__exit__ (exception raised) |
An exception occurred during execution. The exception details are stored in Status_Detail. |
When using the context manager (with ml.create_execution(config) as exe:), the
transitions from running through completed or failed happen automatically.
You do not need to set completed or failed yourself.
To report progress within a long-running execution, call update_status with
Status.running and a descriptive message:
exe.update_status(Status.running, "Training epoch 5/10")
exe.update_status(Status.running, "Evaluating on validation set...")
These updates are written to the Status and Status_Detail columns of the
Execution table, making progress visible to anyone monitoring the catalog.
Execution Metadata
When an execution is created, DerivaML automatically generates several catalog records and metadata files for provenance tracking. Understanding what gets created helps when debugging or auditing workflow runs.
Catalog Records
The following records are created or updated during the execution lifecycle:
| Table | Record | Description |
|---|---|---|
Execution |
One row per execution | Contains RID, Description, Workflow (FK), Status, Status_Detail, and Duration. |
Dataset_Execution |
One row per input dataset | Links each input dataset to the execution, establishing the input provenance chain. |
Execution_Asset_Execution |
One row per input/output asset | Links assets to the execution with a role of "Input" or "Output". Created when assets are downloaded (inputs) or uploaded (outputs). |
Automatic Metadata Files
During initialization, DerivaML uploads several metadata files as Execution_Metadata
assets. Each file is tagged with an asset type from the ExecMetadataType vocabulary:
| Asset Type | File | Contents |
|---|---|---|
Deriva_Config |
configuration.json |
The fully resolved ExecutionConfiguration, serialized as JSON. Includes datasets, assets, workflow reference, description, and Hydra config choices. |
Hydra_Config |
hydra-<timestamp>-*.yaml |
Hydra YAML configuration files (config.yaml, overrides.yaml, etc.) captured from the Hydra runtime output directory. Only present when running via deriva-ml-run. |
Execution_Config |
uv.lock |
Environment lock file from the project root, recording exact dependency versions for reproducibility. |
Runtime_Env |
environment_snapshot_<timestamp>.txt |
A snapshot of the runtime environment: Python version, installed packages, OS details, GPU availability, and environment variables. |
These metadata files are uploaded immediately during initialization (before the model runs), so they are available even if the execution fails.
Configuration Choices
When running via Hydra (deriva-ml-run), the config_choices field in
ExecutionConfiguration captures which named config was selected for each
Hydra config group. For example:
{
"model_config": "cifar10_quick",
"datasets": "cifar10_small_labeled_split",
"workflow": "cifar10_training",
"deriva_ml": "default_deriva"
}
This makes it possible to reproduce a run by selecting the same config choices, rather than reconstructing individual parameter values.
Creating an Execution Configuration
The ExecutionConfiguration specifies what inputs your workflow will use:
from deriva_ml import DerivaML
from deriva_ml.execution import ExecutionConfiguration
from deriva_ml.dataset.aux_classes import DatasetSpec
ml = DerivaML(hostname, catalog_id)
# Create a workflow definition
workflow = ml.create_workflow(
name="ResNet50 Training",
workflow_type="Training",
description="Train ResNet50 on image classification task"
)
# Configure the execution
config = ExecutionConfiguration(
workflow=workflow,
description="Training run with augmented data",
datasets=[
DatasetSpec(rid="1-ABC"), # Use current version
DatasetSpec(rid="1-DEF", version="1.2.0"), # Use specific version
],
assets=["2-GHI", "2-JKL"], # Additional input asset RIDs
)
DatasetSpec Options
| Parameter | Type | Default | Description |
|---|---|---|---|
rid |
str | required | Dataset RID |
version |
str | None | Specific version (default: current) |
materialize |
bool | True | Download asset files (False = metadata only) |
# Download with all files
DatasetSpec(rid="1-ABC", materialize=True)
# Download metadata only (faster for large datasets)
DatasetSpec(rid="1-ABC", materialize=False)
# Use specific version
DatasetSpec(rid="1-ABC", version="2.1.0")
AssetSpec and Asset Caching
Execution assets (model weights, pretrained checkpoints, etc.) are specified as a
list in ExecutionConfiguration.assets. Each entry can be a bare RID string or an
AssetSpec object:
from deriva_ml.asset.aux_classes import AssetSpec
config = ExecutionConfiguration(
workflow=workflow,
description="Inference with cached weights",
datasets=[DatasetSpec(rid="1-ABC")],
assets=[
AssetSpec(rid="2-GHI", cache=True), # Cached by MD5
"2-JKL", # Not cached (bare RID)
],
)
When cache=True, the downloaded asset is stored in the DerivaML cache directory
at cache_dir/assets/{rid}_{md5}/. On subsequent executions, the cached copy is
reused if the MD5 checksum in the catalog record still matches. The cached file
is symlinked into the execution's working directory, avoiding redundant downloads.
| Behavior | cache=False (default) |
cache=True |
|---|---|---|
| First download | Downloads to execution directory | Downloads to cache, symlinks to execution directory |
| Subsequent runs | Downloads again | Reuses cached copy if MD5 matches |
| Stale detection | N/A | Compares MD5 against catalog record |
Use cache=True for large, immutable assets like pretrained model weights. Avoid
caching assets that change frequently, as stale cache entries are only detected by
MD5 mismatch (not by timestamp).
For hydra-zen configurations, use AssetSpecConfig:
from deriva_ml.asset.aux_classes import AssetSpecConfig
asset_store(
[AssetSpecConfig(rid="6-EPNR", cache=True)],
name="pretrained_weights",
)
Running an Execution
Use the context manager pattern for automatic timing:
# Create execution with context manager
with ml.create_execution(config) as exe:
print(f"Execution RID: {exe.execution_rid}")
print(f"Working directory: {exe.working_dir}")
# Download input datasets
bag = exe.download_dataset_bag(DatasetSpec(rid="1-ABC"))
# Access dataset elements
images = bag.list_dataset_members()["Image"]
for img in images:
# img["Filename"] contains local path to the file
process_image(img["Filename"])
# Train your model
model = train_model(images)
# Register output files
model_path = exe.asset_file_path("Model", "best_model.pt")
torch.save(model.state_dict(), model_path)
metrics_path = exe.asset_file_path("Execution_Metadata", "metrics.json")
with open(metrics_path, 'w') as f:
json.dump({"accuracy": 0.95}, f)
# IMPORTANT: Upload after context exits
exe.upload_execution_outputs()
What the Context Manager Does
- On entry: Records start time, sets status to "running"
- On exit: Records stop time, calculates duration
- Exception handling: If an exception occurs, status is set to "failed"
Why Upload is Separate
upload_execution_outputs() is called outside the context manager because:
- Upload can be done asynchronously for large files
- You can inspect outputs before uploading
- Partial uploads can be retried if they fail
- Even failed executions should upload partial results
Tuning Uploads for Large Files
When uploading large files (e.g., model checkpoints > 1 GB), the default timeouts may
not be sufficient. upload_execution_outputs() accepts parameters to control upload
behavior:
# Default behavior (50 MB chunks, 10 min timeout per chunk, 3 retries)
exe.upload_execution_outputs()
# Increase timeout for large files on slow connections (30 min per chunk)
exe.upload_execution_outputs(timeout=(1800, 1800))
# Use smaller chunks if timeouts persist (25 MB chunks)
exe.upload_execution_outputs(chunk_size=25 * 1024 * 1024)
# More retries with longer initial delay
exe.upload_execution_outputs(max_retries=5, retry_delay=10.0)
# Combined: large file on slow connection
exe.upload_execution_outputs(
timeout=(1800, 1800), # 30 min per chunk (both write and read)
chunk_size=25 * 1024 * 1024, # 25 MB chunks (smaller = faster per chunk)
max_retries=5, # 5 retry attempts
retry_delay=10.0, # 10s initial delay (doubles each retry)
)
!!! note
The timeout tuple is (connect_timeout, read_timeout). However, urllib3 uses
connect_timeout as the socket timeout when writing the request body (uploading
chunk data). Both values should be set large enough for a full chunk to be
transferred over your network.
| Parameter | Default | Description |
|---|---|---|
timeout |
(600, 600) |
(connect_timeout, read_timeout) in seconds per chunk |
chunk_size |
50 MB | Chunk size in bytes for hatrac uploads |
max_retries |
3 |
Maximum retry attempts for failed uploads |
retry_delay |
5.0 |
Initial delay between retries (doubles each attempt) |
Registering Output Files
Use asset_file_path() to register files for upload. Each call records the file in a persistent manifest for crash safety:
with ml.create_execution(config) as exe:
# Method 1: Get a path for a new file
output_path = exe.asset_file_path(
asset_name="Model", # Target asset table
file_name="model.pt" # Filename to create
)
torch.save(model, output_path) # Write to the returned path
# Method 2: Stage an existing file
exe.asset_file_path(
asset_name="Image",
file_name="/path/to/existing/file.png", # Existing file
copy_file=True # Copy (default: symlink)
)
# Method 3: Rename during upload
exe.asset_file_path(
asset_name="Image",
file_name="/path/to/temp.png",
rename_file="processed_image.png"
)
# Method 4: Apply asset types
exe.asset_file_path(
asset_name="Image",
file_name="mask.png",
asset_types=["Segmentation_Mask", "Derived"]
)
# Method 5: Provide metadata column values
path = exe.asset_file_path(
asset_name="Image",
file_name="scan.jpg",
metadata={"Subject": subject_rid, "Acquisition_Date": "2026-01-15"}
)
# Metadata can also be updated after registration:
path.set_metadata("Acquisition_Time", "14:30:00")
Updating Status
Report progress during long-running workflows:
from deriva_ml.core.definitions import Status
with ml.create_execution(config) as exe:
exe.update_status(Status.running, "Loading data...")
data = load_data()
exe.update_status(Status.running, "Training model...")
for epoch in range(100):
train_epoch(model, data)
exe.update_status(Status.running, f"Epoch {epoch+1}/100 complete")
exe.update_status(Status.running, "Saving model...")
Creating Output Datasets
If your workflow produces a new curated dataset:
with ml.create_execution(config) as exe:
# Process data and generate outputs
processed_rids = process_data(input_data)
# Create a new dataset linked to this execution
output_dataset = exe.create_dataset(
description="Augmented training images",
dataset_types=["Training", "Augmented"]
)
# Add processed items to the output dataset
output_dataset.add_dataset_members(processed_rids)
exe.upload_execution_outputs()
Restoring Executions
Resume working with a previous execution:
# Restore by RID
exe = ml.restore_execution("1-XYZ")
# Continue working
exe.asset_file_path("Model", "continued_model.pt")
exe.upload_execution_outputs()
Nested Executions
Executions can be organized in a parent-child hierarchy. This is useful for grouping related runs such as parameter sweeps, cross-validation folds, or multi-stage pipelines.
How Nesting Works
A parent execution acts as a container that links to one or more child executions
through the Execution_Execution association table. Each child can have an optional
Sequence number to record ordering.
Parent Execution (sweep description, no input datasets)
├── Child 0: learning_rate=0.001 (sequence=0)
├── Child 1: learning_rate=0.01 (sequence=1)
└── Child 2: learning_rate=0.1 (sequence=2)
The parent typically has a descriptive summary (often markdown-formatted) and no input datasets of its own. The children carry the actual inputs, outputs, and execution logic.
Automatic Nesting with Multirun
When using deriva-ml-run with Hydra's multirun mode, nesting is handled
automatically. The runner creates a parent execution on the first job and links
each subsequent job as a child with incrementing sequence numbers:
# In configs/multiruns.py
from deriva_ml.execution import multirun_config
multirun_config(
"lr_sweep",
overrides=[
"+experiment=cifar10_default",
"model_config.learning_rate=0.001,0.01,0.1",
],
description="""## Learning Rate Sweep
Comparing three learning rates on the default CIFAR-10 configuration.
| Learning Rate | Expected Behavior |
|--------------|-------------------|
| 0.001 | Standard baseline |
| 0.01 | Fast convergence |
| 0.1 | Potentially unstable |
""",
)
Run from the command line:
uv run deriva-ml-run +multirun=lr_sweep
This creates 4 executions: 1 parent + 3 children (one per learning rate value). The parent stores the markdown description; each child records its own inputs, outputs, configuration choices, and timing.
Manual Nesting
You can also create parent-child relationships manually:
# Create parent execution
parent_config = ExecutionConfiguration(
workflow=workflow,
description="Cross-validation experiment (5 folds)",
)
parent = ml.create_execution(parent_config)
parent.execution_start()
# Create and link child executions
for fold in range(5):
child_config = ExecutionConfiguration(
workflow=workflow,
description=f"Fold {fold}",
datasets=[DatasetSpec(rid=fold_datasets[fold])],
)
with ml.create_execution(child_config) as child:
parent.add_nested_execution(child, sequence=fold)
# ... run fold ...
parent.execution_stop()
parent.upload_execution_outputs()
Querying Nested Executions
Navigate the hierarchy using list_nested_executions:
# List direct children
children = parent.list_nested_executions()
for child in children:
print(f" [{child.status}] {child.description}")
# List all descendants (children, grandchildren, etc.)
all_descendants = parent.list_nested_executions(recurse=True)
Each item in the returned list is an ExecutionRecord with properties like
execution_rid, status, and description. To get a full Execution object
with lifecycle management, use ml.restore_execution(child.execution_rid).
Complete Example
from deriva_ml import DerivaML
from deriva_ml.execution import ExecutionConfiguration
from deriva_ml.dataset.aux_classes import DatasetSpec
import torch
import json
# Connect to catalog
ml = DerivaML("your-server.org", "1")
# Define workflow
workflow = ml.create_workflow(
name="Image Classifier Training v3",
workflow_type="Training",
description="Train CNN classifier on medical images"
)
# Configure execution
config = ExecutionConfiguration(
workflow=workflow,
description="Training with learning rate 0.001",
datasets=[DatasetSpec(rid="1-ABC")],
)
# Run execution
with ml.create_execution(config) as exe:
# Download training data
bag = exe.download_dataset_bag(DatasetSpec(rid="1-ABC"))
train_loader = create_dataloader(bag)
# Train model
model = ResNet50()
for epoch in range(50):
loss = train_epoch(model, train_loader)
exe.update_status(Status.running, f"Epoch {epoch}: loss={loss:.4f}")
# Save model
model_path = exe.asset_file_path("Model", "classifier.pt")
torch.save(model.state_dict(), model_path)
# Save metrics
metrics_path = exe.asset_file_path("Execution_Metadata", "training_metrics.json")
with open(metrics_path, 'w') as f:
json.dump({"final_loss": loss, "epochs": 50}, f)
# Upload all outputs
exe.upload_execution_outputs()
print(f"Execution complete: {exe.execution_rid}")