Hydra-zen Configuration
DerivaML integrates with hydra-zen for configuration management, enabling reproducible ML workflows with structured, composable configurations.
Overview
Hydra-zen provides a Pythonic way to configure applications using dataclasses and structured configs. DerivaML leverages this for:
- Environment Configuration: Different settings for dev/staging/production
- Dataset Collections: Named groups of datasets for experiments
- Execution Parameters: Reproducible execution configurations
- Working Directory Management: Automatic Hydra output organization
Quick Start
from hydra_zen import builds, instantiate, store
from deriva_ml import DerivaML, DerivaMLConfig
# Create a structured config for DerivaML
DerivaMLConf = builds(DerivaMLConfig, populate_full_signature=True)
# Configure for your environment
conf = DerivaMLConf(
hostname='deriva.example.org',
catalog_id='42',
domain_schemas='my_domain',
)
# Instantiate to get a DerivaMLConfig, then create DerivaML
config = instantiate(conf)
ml = DerivaML.instantiate(config)
Configuration Classes
DerivaMLConfig
The main configuration class for DerivaML instances:
from deriva_ml import DerivaMLConfig
from hydra_zen import builds
DerivaMLConf = builds(DerivaMLConfig, populate_full_signature=True)
conf = DerivaMLConf(
hostname='example.org', # Deriva server hostname
catalog_id='1', # Catalog ID or name
domain_schemas='my_project', # Domain schema name
working_dir='/shared/workspace', # Base working directory
use_minid=True, # Use MINID for dataset bags
check_auth=True, # Verify authentication
)
| Parameter | Type | Default | Description |
|---|---|---|---|
hostname |
str | required | Deriva server hostname |
catalog_id |
str/int | 1 | Catalog identifier |
domain_schemas |
str | set[str] | None | Domain schema(s) (auto-detected if None) |
working_dir |
str/Path | None | Base directory for outputs |
ml_schema |
str | "deriva-ml" | ML schema name |
use_minid |
bool | True | Use MINID service for datasets |
check_auth |
bool | True | Verify authentication on connect |
DatasetSpecConfig
For configuring dataset specifications in execution configurations:
from deriva_ml.dataset import DatasetSpecConfig
# Create dataset specs for an experiment
training_data = DatasetSpecConfig(
rid="1ABC",
version="1.0.0",
materialize=True, # Download asset files
description="Training images"
)
metadata_only = DatasetSpecConfig(
rid="2DEF",
version="2.0.0",
materialize=False, # Only download table data
)
# Override download timeout for large datasets (connect, read) in seconds
large_data = DatasetSpecConfig(
rid="3GHI",
version="1.0.0",
timeout=[10, 1800], # 30-minute read timeout
)
| Parameter | Type | Default | Description |
|---|---|---|---|
rid |
str | required | Dataset RID |
version |
str | required | Semantic version (e.g., "1.2.0") |
materialize |
bool | True | Download asset files |
description |
str | "" | Description for logging |
timeout |
list[int] | None | Download timeout as [connect, read] in seconds. Default: [10, 610] |
AssetRIDConfig
For configuring input assets (model weights, config files, etc.):
from deriva_ml.execution import AssetRIDConfig
# Define input assets
model_weights = AssetRIDConfig(rid="WXYZ", description="Pretrained model")
config_file = AssetRIDConfig(rid="ABCD", description="Hyperparameters")
assets = [model_weights, config_file]
Working Directory Configuration
DerivaML automatically configures Hydra's output directory based on your working_dir setting:
conf = DerivaMLConf(
hostname='deriva.example.org',
working_dir='/shared/ml_workspace', # Custom working directory
)
The output structure is:
{working_dir}/{username}/deriva-ml/hydra/{timestamp}/
For example:
/shared/ml_workspace/jsmith/deriva-ml/hydra/2024-01-15_10-30-45/
This ensures: - Each user has isolated workspace - Outputs are organized by timestamp - Hydra config files are preserved for reproducibility
Using the Hydra Store
The hydra-zen store allows you to register named configurations:
Environment Configurations
from hydra_zen import store
from deriva_ml import DerivaMLConfig
DerivaMLConf = builds(DerivaMLConfig, populate_full_signature=True)
# Register different environments
deriva_store = store(group="deriva_ml")
deriva_store(DerivaMLConf(
hostname='dev.example.org',
catalog_id='1',
use_minid=False,
), name='dev')
deriva_store(DerivaMLConf(
hostname='prod.example.org',
catalog_id='100',
use_minid=True,
), name='prod')
Dataset Collections
from hydra_zen import store
from deriva_ml.dataset import DatasetSpecConfig
# Define dataset collections
training_v1 = [
DatasetSpecConfig(rid="TRNA", version="1.0.0"),
DatasetSpecConfig(rid="TRNB", version="1.0.0"),
]
training_v2 = [
DatasetSpecConfig(rid="TRNA", version="2.0.0"),
DatasetSpecConfig(rid="TRNB", version="2.0.0"),
DatasetSpecConfig(rid="TRNC", version="1.0.0"),
]
# Store them
datasets_store = store(group="datasets")
datasets_store(training_v1, name="training_v1")
datasets_store(training_v2, name="training_v2")
Asset Collections
from hydra_zen import store
from deriva_ml.execution import AssetRIDConfig
# Define asset collections
resnet_weights = [
AssetRIDConfig(rid="RSN1", description="ResNet50 pretrained"),
]
vit_weights = [
AssetRIDConfig(rid="VIT1", description="ViT-B/16 pretrained"),
AssetRIDConfig(rid="VIT2", description="ViT fine-tuned"),
]
# Store them
assets_store = store(group="assets")
assets_store(resnet_weights, name="resnet")
assets_store(vit_weights, name="vit")
Complete Execution Configuration
Combine all components for a full execution configuration:
from hydra_zen import builds, instantiate, make_config, store
from deriva_ml import DerivaML, DerivaMLConfig
from deriva_ml.execution import ExecutionConfiguration, Workflow
from deriva_ml.dataset import DatasetSpecConfig
# Build configs
DerivaMLConf = builds(DerivaMLConfig, populate_full_signature=True)
ExecConf = builds(ExecutionConfiguration, populate_full_signature=True)
WorkflowConf = builds(
Workflow,
name="Image Classification",
workflow_type="Training",
description="Train CNN classifier",
populate_full_signature=True
)
# Create combined application config
AppConfig = make_config(
deriva_ml=DerivaMLConf(hostname="example.org", catalog_id="1"),
execution=ExecConf(
description="Training run",
datasets=[
DatasetSpecConfig(rid="DATA", version="1.0.0"),
],
assets=["WGTS"],
),
workflow=WorkflowConf,
)
# Use in your application
def train(cfg: AppConfig):
# Instantiate configs
ml_config = instantiate(cfg.deriva_ml)
exec_config = instantiate(cfg.execution)
# Create DerivaML instance
ml = DerivaML.instantiate(ml_config)
# Run execution
with ml.create_execution(exec_config) as exe:
# ... training code ...
pass
Using with Hydra CLI
DerivaML provides deriva-ml-run which handles
Hydra configuration composition, config module loading, and execution tracking
automatically. See Running Models and Notebooks
for the recommended workflow.
The examples below show the underlying Hydra CLI patterns for reference:
# Use default config
python train.py
# Override hostname
python train.py deriva_ml.hostname=staging.example.org
# Use different dataset collection
python train.py +datasets=training_v2
# Multi-run with different configs
python train.py --multirun +datasets=training_v1,training_v2
Example: Complete ML Script
!!! note
In practice, create_model_config() and deriva-ml-run handle the
boilerplate shown below automatically. See
Running Models and Notebooks for the
recommended approach.
"""train.py - Example training script with hydra-zen configuration."""
from hydra_zen import builds, instantiate, store, zen
from deriva_ml import DerivaML, DerivaMLConfig
from deriva_ml.execution import ExecutionConfiguration
from deriva_ml.dataset import DatasetSpecConfig
# Define configs
DerivaMLConf = builds(DerivaMLConfig, populate_full_signature=True)
ExecConf = builds(ExecutionConfiguration, populate_full_signature=True)
# Store environment configs
store(DerivaMLConf(hostname="localhost", catalog_id=1), group="deriva_ml", name="local")
store(DerivaMLConf(hostname="prod.org", catalog_id=100), group="deriva_ml", name="prod")
# Store dataset configs
store([DatasetSpecConfig(rid="DATA", version="1.0.0")], group="datasets", name="default")
# Main config combining all groups
Config = make_config(
defaults=[
"_self_",
{"deriva_ml": "local"},
{"datasets": "default"},
],
deriva_ml=DerivaMLConf,
datasets=list,
description="Training run",
)
@zen(Config)
def main(cfg):
# Instantiate DerivaML
ml_config = instantiate(cfg.deriva_ml)
ml = DerivaML.instantiate(ml_config)
# Create execution config
exec_config = ExecutionConfiguration(
description=cfg.description,
datasets=[instantiate(d) for d in cfg.datasets],
)
# Run
with ml.create_execution(exec_config) as exe:
for ds in exe.datasets:
bag = exe.download_dataset_bag(ds)
# Process data...
exe.upload_execution_outputs()
if __name__ == "__main__":
store.add_to_hydra_store()
main()
Configuring ML Models with DerivaML
A powerful pattern is to use zen_partial=True to create partially configured model functions that receive the DerivaML Execution object at runtime. This allows you to:
- Configure model hyperparameters via Hydra
- Access datasets and assets through the execution object
- Keep model code separate from configuration
Model Protocol
Define a protocol for models that integrate with DerivaML:
# models/model_protocol.py
from typing import Protocol, Any, runtime_checkable
from deriva_ml.execution import Execution
from deriva_ml import DerivaML
@runtime_checkable
class DerivaMLModel(Protocol):
def __call__(self,
*args: Any,
ml_instance: DerivaML,
execution: Execution,
**kwargs: Any) -> None:
"""Interface for functions that integrate DerivaML with ML frameworks."""
...
Model Implementation
Create model functions that follow the protocol:
# models/my_model.py
from deriva_ml.execution import Execution
from deriva_ml import DerivaML, MLAsset, ExecAssetType
def train_classifier(
learning_rate: float,
epochs: int,
batch_size: int,
ml_instance: DerivaML,
execution: Execution | None = None,
) -> None:
"""Train a classifier using DerivaML execution context.
Args:
learning_rate: Learning rate for optimizer
epochs: Number of training epochs
batch_size: Training batch size
ml_instance: DerivaML instance for catalog access
execution: Execution object with datasets, assets, and working directory
"""
# Access input assets (e.g., pretrained weights)
for table, assets in execution.asset_paths.items():
print(f"Loading assets from {table}:")
for asset in assets:
print(f" {asset}")
# Access datasets
for dataset in execution.datasets:
bag = execution.download_dataset_bag(dataset)
# Process dataset...
# Your training code here
print(f"Training with lr={learning_rate}, epochs={epochs}, batch={batch_size}")
# Register output files for upload
model_path = execution.asset_file_path(
MLAsset.execution_asset,
"trained_model.pt",
ExecAssetType.output_file
)
# Save model to model_path...
Model Configuration with zen_partial
Use zen_partial=True to create a partially applied function:
# configs/model.py
from hydra_zen import builds, store
from models.my_model import train_classifier
# Build the base configuration with zen_partial=True
# This creates a callable that waits for ml_instance and execution
ModelConfig = builds(
train_classifier,
learning_rate=1e-3,
epochs=10,
batch_size=32,
populate_full_signature=True,
zen_partial=True, # Key: creates partial function
)
# Register configurations
model_store = store(group="model_config")
model_store(ModelConfig, name="default")
# Create variants by overriding specific parameters
model_store(ModelConfig, name="fast_training", epochs=5, learning_rate=1e-2)
model_store(ModelConfig, name="long_training", epochs=100, learning_rate=1e-4)
model_store(ModelConfig, name="large_batch", batch_size=128, epochs=50)
Model Runner
Create a runner that instantiates the partial config with execution context:
# model_runner.py
import logging
from typing import Any
from deriva_ml import DerivaML, DerivaMLConfig, RID
from deriva_ml.dataset import DatasetSpec
from deriva_ml.execution import ExecutionConfiguration, Workflow
def run_model(
deriva_ml: DerivaMLConfig,
datasets: list[DatasetSpec],
assets: list[RID],
description: str,
workflow: Workflow,
model_config: Any, # Partially configured model callable
dry_run: bool = False,
) -> None:
"""Execute a configured model with DerivaML.
Args:
deriva_ml: DerivaML connection configuration
datasets: List of dataset specifications
assets: List of input asset RIDs
description: Execution description
workflow: Workflow definition
model_config: Partially configured model (from zen_partial)
dry_run: If True, don't record execution in catalog
"""
# Connect to catalog
ml_instance = DerivaML.instantiate(deriva_ml)
# Create execution configuration
execution_config = ExecutionConfiguration(
datasets=datasets,
assets=assets,
description=description
)
execution = ml_instance.create_execution(
execution_config,
workflow=workflow,
dry_run=dry_run
)
with execution.execute() as exe:
# Complete the partial function with runtime arguments
model_config(ml_instance=ml_instance, execution=exe)
# Upload outputs after execution completes
execution.upload_execution_outputs()
Main Script
Tie everything together with a Hydra entry point:
# train.py
from hydra_zen import store, zen, builds
from model_runner import run_model
# Build main application config with defaults
app_config = builds(
run_model,
description="Model training run",
populate_full_signature=True,
hydra_defaults=[
"_self_",
{"deriva_ml": "default"},
{"datasets": "training"},
{"assets": "weights"},
{"workflow": "training_workflow"},
{"model_config": "default"},
],
)
store(app_config, name="train_app")
# Import config modules to register them
import configs.deriva # noqa: F401
import configs.datasets # noqa: F401
import configs.assets # noqa: F401
import configs.workflow # noqa: F401
import configs.model # noqa: F401
if __name__ == "__main__":
store.add_to_hydra_store()
zen(run_model).hydra_main(
config_name="train_app",
version_base="1.3",
config_path=None,
)
Running the Model
# Run with defaults
python train.py
# Override model config
python train.py model_config=long_training
# Override multiple parameters
python train.py model_config=fast_training datasets=validation
# Override individual hyperparameters
python train.py model_config.epochs=25 model_config.learning_rate=0.001
# Multi-run experiments
python train.py --multirun model_config=default,fast_training,long_training
Key Benefits of zen_partial
- Separation of concerns: Model hyperparameters are configured separately from runtime context
- Deferred execution: The model function isn't called until
ml_instanceandexecutionare available - Config variants: Easy to create model variants by overriding specific parameters
- CLI flexibility: All hyperparameters are exposed to Hydra's CLI
- Reproducibility: Full configuration is logged by Hydra
Configuration Descriptions
Adding descriptions to configurations helps users and AI assistants understand and discover the right configurations. DerivaML provides two mechanisms depending on the configuration type:
For List-Based Configs (Assets, Datasets)
Use with_description() to wrap lists of RIDs or DatasetSpecConfig objects:
from hydra_zen import store
from deriva_ml.dataset import DatasetSpecConfig
from deriva_ml.execution import with_description
# Datasets with descriptions
datasets_store = store(group="datasets")
datasets_store(
with_description(
[DatasetSpecConfig(rid="28D4", version="0.22.0")],
"Split dataset with 10,000 images (5,000 train + 5,000 test). "
"Testing images are unlabeled. Use for standard train/test workflows."
),
name="cifar10_split",
)
# Assets with descriptions
asset_store = store(group="assets")
asset_store(
with_description(
["3WMG", "3XPA"],
"Model weights from quick (3WMG) and extended (3XPA) training runs. "
"Use for comparison experiments."
),
name="comparison_weights",
)
# Empty default
asset_store(
with_description([], "No assets - empty default configuration"),
name="default_asset",
)
After instantiation, config.datasets and config.assets behave like regular lists but have a .description attribute:
# Normal list operations work
for dataset in config.datasets:
print(dataset.rid)
# Access description
print(config.assets.description) # "Model weights from quick..."
For Model Configs (builds())
Use zen_meta parameter when storing builds() configs:
from hydra_zen import builds, store
from models.my_model import train_classifier
model_store = store(group="model_config")
ModelConfig = builds(
train_classifier,
learning_rate=1e-3,
epochs=10,
populate_full_signature=True,
zen_partial=True,
)
# Add description via zen_meta
model_store(
ModelConfig,
name="default_model",
zen_meta={
"description": (
"Default training config: 10 epochs, lr=1e-3. "
"Balanced for standard training runs."
)
},
)
# Variant with description
model_store(
ModelConfig,
name="quick",
epochs=3,
batch_size=128,
zen_meta={
"description": (
"Quick validation: 3 epochs, batch 128. "
"Use for rapid iteration and debugging."
)
},
)
Summary: When to Use Each Mechanism
| Config Type | Storage Pattern | Description Mechanism |
|---|---|---|
| Assets (RID lists) | store(["RID1", "RID2"], ...) |
with_description(items, desc) |
| Datasets (DatasetSpecConfig lists) | store([DatasetSpecConfig(...)], ...) |
with_description(items, desc) |
| Model configs | store(builds(...), ...) |
zen_meta={"description": desc} |
| Workflow configs | store(builds(Workflow, ...), ...) |
zen_meta={"description": desc} |
Writing Good Descriptions
Include: - What it contains: Size, types, key parameters - Where it came from: Source execution, version - When to use it: Training, testing, debugging, production
Examples:
# ✓ Good dataset description
"Training dataset with 5,000 labeled CIFAR-10 images (32x32 RGB). "
"All images have ground truth classifications."
# ✓ Good asset description
"Model weights (model.pt) from extended training: 50 epochs, "
"64→128 channels, dropout 0.25. Use for inference or fine-tuning."
# ✓ Good model config description
"Quick training: 3 epochs, batch 128. Use for rapid iteration "
"and verifying the training pipeline works."
# ✗ Bad (too vague)
"Training dataset"
"Model weights"
"Quick config"
Best Practices
- Use
builds()withpopulate_full_signature=Trueto expose all parameters - Use
zen_partial=Truefor model functions that need runtime context - Store related configs in the same group for easy composition
- Use descriptive names for stored configurations
- Set
working_dirfor reproducible output locations - Use
DatasetSpecConfiginstead of buildingDatasetSpecdirectly for cleaner configs - Use
AssetRIDConfigfor consistent asset specification - Define a model protocol for consistent model interfaces across your project
- Always add descriptions using
with_description()for lists orzen_metafor builds
Related Documentation
- Running Models and Notebooks — Practical guide for project setup and CLI usage
- CLI Reference — All DerivaML command-line tools
- Execution Configuration — Execution lifecycle and workflows
- Git Workflow and Versioning — Reproducibility guidelines
- Datasets
- Hydra-zen Documentation
- Hydra Documentation