Skip to content

CLI Reference

DerivaML provides several command-line tools for running ML workflows, managing versions, and administering catalogs. All commands are installed as console scripts when you install the deriva-ml package.

When using a project managed with uv, prefix commands with uv run:

uv run deriva-ml-run +experiment=my_experiment

Command Overview

Command Description
deriva-ml-run Execute ML models with Hydra configuration and execution tracking
deriva-ml-run-notebook Execute Jupyter notebooks with parameter injection and tracking
bump-version Bump semantic version tags and push to remote
deriva-ml-install-kernel Install a Jupyter kernel for the current virtual environment
deriva-ml-split-dataset Split a dataset into training and testing subsets
deriva-ml-create-schema Create the DerivaML schema in a catalog
deriva-ml-check-catalog-schema Validate a catalog's schema against DerivaML requirements
deriva-ml-table-comments-utils Update table and column comments from documentation files
create-demo-catalog Create a demo catalog with sample data for testing

Model and Notebook Execution

deriva-ml-run

Execute ML models with Hydra-zen configuration composition and automatic execution tracking in a Deriva catalog.

Synopsis:

deriva-ml-run [--host HOST] [--catalog CATALOG] [--config-dir DIR]
              [--config-name NAME] [--info] [--multirun|-m] [OVERRIDES...]

Arguments:

Argument Default Description
--host HOST (from config) Deriva server hostname
--catalog CATALOG (from config) Catalog ID or identifier
--config-dir DIR, -c src/configs Path to the configs directory
--config-name NAME deriva_model Name of the main Hydra-zen config
--info Display available configuration groups and options
--multirun, -m Enable Hydra multirun mode for parameter sweeps
OVERRIDES Hydra-zen configuration overrides (positional)

Examples:

# Run with default configuration
uv run deriva-ml-run

# Override a config group
uv run deriva-ml-run model_config=my_model datasets=full_training

# Override individual parameters
uv run deriva-ml-run model_config.epochs=50 model_config.learning_rate=0.001

# Use an experiment preset
uv run deriva-ml-run +experiment=cifar10_quick

# Dry run (download inputs, skip catalog writes)
uv run deriva-ml-run dry_run=true

# Show all available configs
uv run deriva-ml-run --info

# Override host and catalog from command line
uv run deriva-ml-run --host prod.example.org --catalog 100

# Multirun with comma-separated values
uv run deriva-ml-run --multirun model_config.learning_rate=0.0001,0.001,0.01

# Named multirun configuration
uv run deriva-ml-run +multirun=lr_sweep

# Named multirun with additional overrides
uv run deriva-ml-run +multirun=lr_sweep model_config.epochs=5

See also: Running Models


deriva-ml-run-notebook

Execute Jupyter notebooks with parameter injection, automatic kernel detection, and execution tracking. The executed notebook and a Markdown conversion are uploaded to the catalog as execution assets.

Synopsis:

deriva-ml-run-notebook NOTEBOOK [--host HOST] [--catalog CATALOG]
                       [--file FILE] [--parameter KEY VALUE]
                       [--kernel KERNEL] [--inspect] [--info]
                       [--log-output] [OVERRIDES...]

Arguments:

Argument Default Description
NOTEBOOK (required) Path to the .ipynb notebook file
--host HOST (from config) Deriva server hostname
--catalog CATALOG (from config) Catalog ID or identifier
--file FILE, -f JSON or YAML file with parameter values
--parameter KEY VALUE, -p Parameter name and value to inject (repeatable)
--kernel KERNEL, -k (auto-detected) Jupyter kernel name
--inspect Display notebook parameters and exit
--info Display available Hydra configuration groups
--log-output Stream cell outputs during execution
OVERRIDES Hydra-zen configuration overrides (positional)

Environment Variables Set During Execution:

Variable Purpose
DERIVA_ML_WORKFLOW_URL Git URL or local path to the notebook source
DERIVA_ML_WORKFLOW_CHECKSUM MD5 checksum of the notebook file
DERIVA_ML_NOTEBOOK_PATH Absolute filesystem path to the notebook
DERIVA_ML_SAVE_EXECUTION_RID Path where the notebook saves execution metadata
DERIVA_ML_HYDRA_OVERRIDES JSON-encoded list of Hydra overrides

Examples:

# Run a notebook with default configuration
uv run deriva-ml-run-notebook notebooks/analyze_results.ipynb

# Override Hydra config groups (positional overrides)
uv run deriva-ml-run-notebook notebooks/analysis.ipynb \
    assets=my_assets deriva_ml=production

# Inject parameters into the notebook's parameter cell
uv run deriva-ml-run-notebook notebooks/train.ipynb \
    -p learning_rate 0.001 -p epochs 50

# Load parameters from a YAML file
uv run deriva-ml-run-notebook notebooks/train.ipynb --file params.yaml

# Inspect available notebook parameters without running
uv run deriva-ml-run-notebook notebooks/train.ipynb --inspect

# Show available Hydra config groups
uv run deriva-ml-run-notebook notebooks/analysis.ipynb --info

# Stream notebook output to terminal
uv run deriva-ml-run-notebook notebooks/train.ipynb --log-output

# Override host and catalog
uv run deriva-ml-run-notebook notebooks/analysis.ipynb \
    --host prod.example.org --catalog 100

See also: Running Models, Notebook Configuration


Development Tools

bump-version

Manage semantic version tags for your project. Creates an initial tag if none exists, or bumps the existing version using bump-my-version.

This tool works with setuptools_scm for dynamic version derivation from git tags — there is no hardcoded version string in the source code.

Synopsis:

bump-version [patch|minor|major]

Arguments:

Argument Default Description
patch\|minor\|major patch Which semantic version component to increment

Environment Variables:

Variable Default Description
START 0.1.0 Initial version if no tag exists
PREFIX v Tag prefix (e.g., v for tags like v1.2.3)

How Versioning Works:

  • At a tag: Version is clean, e.g., 1.2.3
  • After a tag: Includes distance and commit hash, e.g., 1.2.3.post2+g1234abc
  • Dirty working tree: Adds .dirty suffix

Examples:

# Bump patch version (1.2.3 -> 1.2.4)
uv run bump-version

# Bump minor version (1.2.3 -> 1.3.0)
uv run bump-version minor

# Bump major version (1.2.3 -> 2.0.0)
uv run bump-version major

# Check current version
uv run python -m setuptools_scm

Requirements: git, uv, and bump-my-version configured in pyproject.toml.


deriva-ml-install-kernel

Install a Jupyter kernel for the current virtual environment. This allows Jupyter notebooks to use the DerivaML environment with all its dependencies.

Synopsis:

deriva-ml-install-kernel [--install-local]

Arguments:

Argument Description
--install-local Install kernel to the venv's prefix directory instead of the user's Jupyter data directory

The kernel name and display name are derived from the virtual environment's prompt setting in pyvenv.cfg.

Example Workflow:

# Create virtual environment with a name
uv venv --prompt my-ml-project

# Activate it
source .venv/bin/activate

# Install the Jupyter kernel
uv run deriva-ml-install-kernel
# Output: Installed Jupyter kernel 'my-ml-project' with display name 'Python (my-ml-project)'

# The kernel now appears in Jupyter's kernel selector
jupyter lab

Kernel location: ~/.local/share/jupyter/kernels/ (Linux/macOS) or %APPDATA%\jupyter\kernels\ (Windows).


Data Operations

deriva-ml-split-dataset

Split a DerivaML dataset into training and testing subsets. Follows scikit-learn conventions for split parameters and supports stratified splitting.

Synopsis:

deriva-ml-split-dataset --hostname HOST --catalog-id ID --dataset-rid RID
                        [--test-size SIZE] [--train-size SIZE] [--seed SEED]
                        [--stratify-by-column COL] [--element-table TABLE]
                        [--include-tables TABLES] [--training-types TYPES]
                        [--testing-types TYPES] [--description DESC]
                        [--workflow-type TYPE] [--dry-run] [--show-urls]
                        [--no-shuffle]

Arguments:

Argument Default Description
--hostname (required) Deriva server hostname
--catalog-id (required) Catalog ID to connect to
--dataset-rid (required) RID of the source dataset to split
--domain-schema (auto-detected) Domain schema name
--test-size 0.2 Test set size as fraction (0-1) or absolute count
--train-size (complement) Train set size as fraction (0-1) or absolute count
--seed 42 Random seed for reproducibility
--no-shuffle Do not shuffle before splitting
--stratify-by-column Column name for stratified splitting (requires --include-tables)
--element-table (auto-detected) Element table to split (e.g., Image)
--include-tables Comma-separated tables for denormalization (required for stratified splitting)
--training-types Labeled Comma-separated dataset types for the training set
--testing-types Labeled Comma-separated dataset types for the testing set
--description Description for the parent split dataset
--workflow-type Dataset_Split Workflow type vocabulary term
--dry-run Print the split plan without modifying the catalog
--show-urls Show Chaise web interface URLs for created datasets

Examples:

# Simple random 80/20 split
uv run deriva-ml-split-dataset --hostname localhost --catalog-id 9 \
    --dataset-rid 28D0

# Stratified split by class label
uv run deriva-ml-split-dataset --hostname localhost --catalog-id 9 \
    --dataset-rid 28D0 \
    --stratify-by-column Image_Classification_Image_Class \
    --include-tables Image,Image_Classification

# Fixed-count split
uv run deriva-ml-split-dataset --hostname localhost --catalog-id 9 \
    --dataset-rid 28D0 --train-size 400 --test-size 100

# Dry run (show plan without modifying catalog)
uv run deriva-ml-split-dataset --hostname localhost --catalog-id 9 \
    --dataset-rid 28D0 --dry-run

create-demo-catalog

Create a demonstration catalog with sample data for testing and development.

Synopsis:

create-demo-catalog --host HOST [--domain-schema SCHEMA]

Arguments:

Argument Default Description
--host (required) Deriva server hostname
--domain-schema demo-schema Name for the domain schema

This command is primarily used for development and testing of DerivaML itself.


Catalog Administration

deriva-ml-create-schema

Create the DerivaML schema in a Deriva catalog. This is typically run once when setting up a new catalog for ML workflows.

Synopsis:

deriva-ml-create-schema HOSTNAME PROJECT_NAME SCHEMA_NAME --curie_prefix PREFIX

Arguments:

Argument Default Description
HOSTNAME (required) Deriva server hostname
PROJECT_NAME (required) Project name for the catalog
SCHEMA_NAME deriva-ml Schema name
--curie_prefix (required) CURIE prefix for identifiers

deriva-ml-check-catalog-schema

Validate a catalog's schema against the DerivaML reference schema. Reports any missing tables, columns, or configuration issues.

Synopsis:

deriva-ml-check-catalog-schema --host HOST [--catalog CATALOG] [--dump]

Arguments:

Argument Default Description
--host (required) Deriva server hostname
--catalog 1 Catalog number
--dump Dump schema details

deriva-ml-table-comments-utils

Update table and column comments in a catalog from file-based documentation. This is an administrative utility for maintaining schema documentation.

Synopsis:

deriva-ml-table-comments-utils --host HOST [--catalog CATALOG]

This command uses Deriva's BaseCLI for standard host/catalog arguments.