`DerivaML` Class

The DerivaML class provides a range of methods to interact with a Deriva catalog. These methods assume tha tthe catalog contains a deriva-ml and a domain schema.

Data Catalog: The catalog must include both the domain schema and a standard ML schema for effective data management.

ERD

Domain schema: The domain schema includes the data collected or generated by domain-specific experiments or systems.
ML schema: Each entity in the ML schema is designed to capture details of the ML development process. It including the following tables
- A Dataset represents a data collection, such as aggregation identified for training, validation, and testing purposes.
- A Workflow represents a specific sequence of computational steps or human interactions.
- An Execution is an instance of a workflow that a user instantiates at a specific time.
- An Execution Asset is an output file that results from the execution of a workflow.
- An Execution Metadata is an asset entity for saving metadata files referencing a given execution.

Core module for DerivaML.

This module provides the primary public interface to DerivaML functionality. It exports the main DerivaML class along with configuration, definitions, and exceptions needed for interacting with Deriva-based ML catalogs.

Key exports

DerivaML: Main class for catalog operations and ML workflow management.
DerivaMLConfig: Configuration class for DerivaML instances.
Exceptions: DerivaMLException and specialized exception types.
Definitions: Type definitions, enums, and constants used throughout the package.

Example

from deriva_ml.core import DerivaML, DerivaMLConfig # doctest: +SKIP ml = DerivaML('deriva.example.org', 'my_catalog') # doctest: +SKIP datasets = ml.find_datasets() # doctest: +SKIP

BuiltinTypes `module-attribute`

BuiltinTypes = BuiltinType

Alias for BuiltinType from deriva-py's deriva.core.typed.

Both BuiltinType and BuiltinTypes are part of deriva-ml's public API; they resolve to the same class.

ColumnDefinition `module-attribute`

ColumnDefinition = ColumnDef

Alias for ColumnDef from deriva-py's deriva.core.typed.

Both ColumnDef and ColumnDefinition are part of deriva-ml's public API; they resolve to the same class.

TableDefinition `module-attribute`

TableDefinition = TableDef

Alias for TableDef from deriva-py's deriva.core.typed.

Both TableDef and TableDefinition are part of deriva-ml's public API; they resolve to the same class.

DerivaML

Bases: PathBuilderMixin, RidResolutionMixin, VocabularyMixin, WorkflowMixin, FeatureMixin, DatasetMixin, AssetMixin, ExecutionMixin, FileMixin, AnnotationMixin, DerivaMLCatalog

Core class for machine learning operations on a Deriva catalog.

This class provides core functionality for managing ML workflows, features, and datasets in a Deriva catalog. It handles data versioning, feature management, vocabulary control, and execution tracking.

Method naming convention

find_* methods search the catalog for entities of a kind, optionally filtered. Examples: find_features(table=None), find_datasets(), find_workflows(), find_executions(), find_experiments(), find_assets(). find_* returns everything that matches; pass arguments to narrow the search.
list_* methods enumerate things scoped to a specific entity passed as the first argument. Examples: list_assets(asset_table), list_dataset_members(dataset), list_dataset_children(dataset), list_workflow_executions(workflow), list_vocabulary_terms(table). list_* always has a "scope" argument; there is no scope-less list_* flavor for entities of a given kind — use find_* for that.

So: "all features on the catalog" → find_features(); "all features on table T" → find_features(T) (scoping is a filter); "all members of dataset D" → list_dataset_members(D) (scoping is the parent entity itself). There is no list_features() because features aren't scoped to a parent entity in the way dataset members are scoped to a dataset.

Attributes:

Name	Type	Description
`host_name`	`str`	Hostname of the Deriva server (e.g., 'deriva.example.org').
`catalog_id`	`Union[str, int]`	Catalog identifier or name.
`domain_schemas`	`frozenset[str]`	Schema names for domain-specific tables and relationships.
`model`	`DerivaModel`	ERMRest model for the catalog.
`working_dir`	`Path`	Directory for storing computation data and results.
`cache_dir`	`Path`	Directory for caching downloaded datasets.
`ml_schema`	`str`	Schema name for ML-specific tables (default: 'deriva_ml').
`configuration`	`ExecutionConfiguration`	Current execution configuration.
`project_name`	`str`	Name of the current project.
`start_time`	`datetime`	Timestamp when this instance was created.

Example

ml = DerivaML('deriva.example.org', 'my_catalog') # doctest: +SKIP ml.create_feature('my_table', 'new_feature') # doctest: +SKIP ml.add_term('vocabulary_table', 'new_term', description='Description of term') # doctest: +SKIP

Source code in src/deriva_ml/core/base.py

class DerivaML(
    PathBuilderMixin,
    RidResolutionMixin,
    VocabularyMixin,
    WorkflowMixin,
    FeatureMixin,
    DatasetMixin,
    AssetMixin,
    ExecutionMixin,
    FileMixin,
    AnnotationMixin,
    DerivaMLCatalog,
):
    """Core class for machine learning operations on a Deriva catalog.

    This class provides core functionality for managing ML workflows, features, and datasets in a Deriva catalog.
    It handles data versioning, feature management, vocabulary control, and execution tracking.

    Method naming convention:
        - ``find_*`` methods search the catalog for entities of a kind, optionally filtered.
          Examples: ``find_features(table=None)``, ``find_datasets()``, ``find_workflows()``,
          ``find_executions()``, ``find_experiments()``, ``find_assets()``. ``find_*`` returns
          everything that matches; pass arguments to narrow the search.
        - ``list_*`` methods enumerate things scoped to a specific entity passed as the first
          argument. Examples: ``list_assets(asset_table)``, ``list_dataset_members(dataset)``,
          ``list_dataset_children(dataset)``, ``list_workflow_executions(workflow)``,
          ``list_vocabulary_terms(table)``. ``list_*`` always has a "scope" argument; there is
          no scope-less ``list_*`` flavor for entities of a given kind — use ``find_*`` for that.

        So: "all features on the catalog" → ``find_features()``; "all features on table T" →
        ``find_features(T)`` (scoping is a filter); "all members of dataset D" →
        ``list_dataset_members(D)`` (scoping is the parent entity itself). There is no
        ``list_features()`` because features aren't scoped to a parent entity in the way
        dataset members are scoped to a dataset.

    Attributes:
        host_name (str): Hostname of the Deriva server (e.g., 'deriva.example.org').
        catalog_id (Union[str, int]): Catalog identifier or name.
        domain_schemas (frozenset[str]): Schema names for domain-specific tables and relationships.
        model (DerivaModel): ERMRest model for the catalog.
        working_dir (Path): Directory for storing computation data and results.
        cache_dir (Path): Directory for caching downloaded datasets.
        ml_schema (str): Schema name for ML-specific tables (default: 'deriva_ml').
        configuration (ExecutionConfiguration): Current execution configuration.
        project_name (str): Name of the current project.
        start_time (datetime): Timestamp when this instance was created.

    Example:
        >>> ml = DerivaML('deriva.example.org', 'my_catalog')  # doctest: +SKIP
        >>> ml.create_feature('my_table', 'new_feature')  # doctest: +SKIP
        >>> ml.add_term('vocabulary_table', 'new_term', description='Description of term')  # doctest: +SKIP
    """

    # Class-level type annotations for DerivaMLCatalog protocol compliance
    ml_schema: str
    domain_schemas: frozenset[str]
    default_schema: str | None
    model: DerivaModel
    cache_dir: Path
    working_dir: Path
    catalog: ErmrestCatalog | ErmrestSnapshot
    catalog_id: str | int

    @classmethod
    def instantiate(cls, config: DerivaMLConfig) -> Self:
        """Create a DerivaML instance from a configuration object.

        This method is the preferred way to instantiate DerivaML when using hydra-zen
        for configuration management. It accepts a DerivaMLConfig (Pydantic model) and
        unpacks it to create the instance.

        This pattern allows hydra-zen's `instantiate()` to work with DerivaML:

        Example with hydra-zen:
            >>> from hydra_zen import builds, instantiate  # doctest: +SKIP
            >>> from deriva_ml import DerivaML  # doctest: +SKIP
            >>> from deriva_ml.core.config import DerivaMLConfig  # doctest: +SKIP
            >>>
            >>> # Create a structured config using hydra-zen
            >>> DerivaMLConf = builds(DerivaMLConfig, populate_full_signature=True)  # doctest: +SKIP
            >>>
            >>> # Configure for your environment
            >>> conf = DerivaMLConf(  # doctest: +SKIP
            ...     hostname='deriva.example.org',
            ...     catalog_id='42',
            ...     domain_schemas={'my_domain'},
            ... )
            >>>
            >>> # Instantiate the config to get a DerivaMLConfig object
            >>> config = instantiate(conf)  # doctest: +SKIP
            >>>
            >>> # Create the DerivaML instance
            >>> ml = DerivaML.instantiate(config)  # doctest: +SKIP

        Args:
            config: A DerivaMLConfig object containing all configuration parameters.

        Returns:
            A new DerivaML instance configured according to the config object.

        Note:
            The DerivaMLConfig class integrates with Hydra's configuration system
            and registers custom resolvers for computing working directories.
            See `deriva_ml.core.config` for details on configuration options.
        """
        return cls(**config.model_dump())

    @classmethod
    def from_context(cls, path: Path | str | None = None) -> Self:
        """Create a DerivaML instance from a .deriva-context.json file.

        Searches for .deriva-context.json starting from ``path`` (default: cwd),
        walking up parent directories. This enables scripts generated by Claude
        to connect to the same catalog without hardcoding connection details.

        The context file is written by the MCP server's ``connect_catalog`` tool
        and contains hostname, catalog_id, and default_schema.

        Args:
            path: Starting directory to search for the context file.
                Defaults to the current working directory.

        Returns:
            A new DerivaML instance configured from the context file.

        Raises:
            FileNotFoundError: If no .deriva-context.json is found.

        Example::

            # In a script generated by Claude:
            from deriva_ml import DerivaML
            ml = DerivaML.from_context()
            subjects = ml.cache_table("Subject")
        """
        import json

        start = Path(path) if path else Path.cwd()
        context_file = _find_context_file(start)
        with open(context_file) as f:
            ctx = json.load(f)

        kwargs: dict[str, Any] = {
            "hostname": ctx["hostname"],
            "catalog_id": ctx["catalog_id"],
        }
        if ctx.get("default_schema"):
            kwargs["default_schema"] = ctx["default_schema"]
        if ctx.get("working_dir"):
            kwargs["working_dir"] = ctx["working_dir"]

        return cls(**kwargs)

    def __init__(
        self,
        hostname: str,
        catalog_id: str | int,
        domain_schemas: str | set[str] | None = None,
        default_schema: str | None = None,
        project_name: str | None = None,
        cache_dir: str | Path | None = None,
        working_dir: str | Path | None = None,
        hydra_runtime_output_dir: str | Path | None = None,
        ml_schema: str = ML_SCHEMA,
        logging_level: int = logging.WARNING,
        deriva_logging_level: int = logging.WARNING,
        credential: dict | None = None,
        s3_bucket: str | None = None,
        use_minid: bool | None = None,
        clean_execution_dir: bool = True,
        mode: ConnectionMode | str = ConnectionMode.online,
    ) -> None:
        """Initializes a DerivaML instance.

        This method will connect to a catalog and initialize local configuration for the ML execution.
        This class is intended to be used as a base class on which domain-specific interfaces are built.

        Args:
            hostname: Hostname of the Deriva server.
            catalog_id: Catalog ID. Either an identifier or a catalog name.
            domain_schemas: Optional set of domain schema names. If None, auto-detects all
                non-system schemas. Use this when working with catalogs that have multiple
                user-defined schemas.
            default_schema: The default schema for table creation operations. If None and
                there is exactly one domain schema, that schema is used. If there are multiple
                domain schemas, this must be specified for table creation to work without
                explicit schema parameters.
            ml_schema: Schema name for ML schema. Used if you have a non-standard configuration of deriva-ml.
            project_name: Project name. Defaults to name of default_schema.
            cache_dir: Directory path for caching data downloaded from the Deriva server as bdbag. If not provided,
                will default to working_dir.
            working_dir: Directory path for storing data used by or generated by any computations. If no value is
                provided, will default to  ${HOME}/deriva_ml
            s3_bucket: S3 bucket URL for dataset bag storage (e.g., 's3://my-bucket'). If provided,
                enables MINID creation and S3 upload for dataset exports. If None, MINID functionality
                is disabled regardless of use_minid setting.
            use_minid: Use the MINID service when downloading dataset bags. Only effective when
                s3_bucket is configured. If None (default), automatically set to True when s3_bucket
                is provided, False otherwise.
            clean_execution_dir: Whether to automatically clean up execution working directories
                after successful upload. Defaults to True. Set to False to retain local copies.
            mode: Connection mode for this instance. ``ConnectionMode.online`` (default)
                sends writes to the catalog eagerly; ``ConnectionMode.offline`` stages
                writes into local SQLite for later upload. Accepts the string
                literals ``"online"`` or ``"offline"``; any other value raises
                ``ValueError``. See spec §2.1.
        """
        # Store connection mode (see spec §2.1).
        # Done before catalog connection so subclasses/mixins can read
        # ``self._mode`` during their own setup if needed.
        # ``ConnectionMode(x)`` is idempotent on enum members and coerces
        # strings ("online"/"offline") uniformly; unknown strings raise ValueError.
        self._mode = ConnectionMode(mode)

        # Get or use provided credentials for server access.
        # get_credential() reads ~/.deriva/credential.json; no network.
        self.credential = credential or get_credential(hostname)

        # Set up working and cache directories. Done BEFORE catalog/
        # schema setup so SchemaCache can be constructed for either
        # mode branch below.
        # If working_dir is already provided (e.g. from DerivaMLConfig.instantiate()),
        # use it directly; otherwise compute the default path.
        if working_dir is not None:
            self.working_dir = Path(working_dir).absolute()
        else:
            self.working_dir = DerivaMLConfig.compute_workdir(None, catalog_id, hostname)
        self.working_dir.mkdir(parents=True, exist_ok=True)
        self.hydra_runtime_output_dir = hydra_runtime_output_dir

        self.cache_dir = Path(cache_dir) if cache_dir else self.working_dir / "cache"
        self.cache_dir.mkdir(parents=True, exist_ok=True)

        # Mode-branched init: online connects to the catalog and
        # verifies (or populates) the schema cache; offline reads
        # the cache and skips all network calls.
        cache = SchemaCache(self.working_dir)
        if self._mode is ConnectionMode.online:
            self._init_online(
                hostname=hostname,
                catalog_id=catalog_id,
                cache=cache,
                ml_schema=ml_schema,
                domain_schemas=domain_schemas,
                default_schema=default_schema,
            )
        else:
            self._init_offline(
                hostname=hostname,
                catalog_id=catalog_id,
                cache=cache,
                ml_schema=ml_schema,
                domain_schemas=domain_schemas,
                default_schema=default_schema,
            )

        # Store S3 bucket configuration and resolve use_minid
        self.s3_bucket = s3_bucket
        if use_minid is None:
            # Auto mode: enable MINID if s3_bucket is configured
            self.use_minid = s3_bucket is not None
        elif use_minid and s3_bucket is None:
            # User requested MINID but no S3 bucket configured - disable MINID
            self.use_minid = False
        else:
            self.use_minid = use_minid

        # Set up logging using centralized configuration
        # This configures deriva_ml, Hydra, and deriva-py loggers without
        # affecting the root logger or calling basicConfig()
        self._logger = configure_logging(
            level=logging_level,
            deriva_level=deriva_logging_level,
        )
        self._logging_level = logging_level
        self._deriva_logging_level = deriva_logging_level

        # Apply deriva's default logger overrides for fine-grained control
        _apply_logger_overrides(DEFAULT_LOGGER_OVERRIDES)

        # Store instance configuration
        self.host_name = hostname
        self.catalog_id = catalog_id
        self.ml_schema = ml_schema
        self.configuration = None
        self._execution: Execution | None = None
        self.domain_schemas = self.model.domain_schemas
        self.default_schema = self.model.default_schema
        self.project_name = project_name or self.default_schema or "deriva-ml"
        self.start_time = datetime.now()
        self.clean_execution_dir = clean_execution_dir

    def __del__(self) -> None:
        """Cleanup method to handle incomplete executions.

        Best-effort abort on DerivaML shutdown — only for executions that
        died mid-flight (i.e., still in ``Created`` or ``Running``). Any
        post-Running status (``Stopped``, ``Failed``, ``Pending_Upload``,
        ``Uploaded``, ``Aborted``) is treated as terminal here: the user
        has either committed cleanly via the context manager or
        explicitly transitioned the execution, and a forced abort would
        either be a no-op or a wrongful state change.

        Forcing a transition during ``__del__`` is also unsafe at object-
        teardown time: Python's GC ordering means the underlying
        ``ErmrestCatalog`` HTTP session may already be finalized, in
        which case the catalog PUT would crash with ``'NoneType' object
        has no attribute 'get'`` (the catalog's ``_session`` reads as
        ``None``). Limiting the abort to non-terminal states avoids the
        common case where ``__exit__`` already moved the execution to
        ``Stopped`` and ``__del__`` would otherwise re-transition to
        ``Aborted`` against a dead session.
        """
        # Inline import to avoid a circular (core.base ↔ execution.state_store) import.
        try:
            from deriva_ml.execution.state_store import ExecutionStatus

            non_terminal = {ExecutionStatus.Created, ExecutionStatus.Running}
            if self._execution and self._execution.status in non_terminal:
                self._execution.update_status(ExecutionStatus.Aborted, error="Execution Aborted")
        except Exception:
            # Any failure here (catalog unreachable, InvalidTransition, etc.)
            # is swallowed — __del__ must not raise.
            pass

    def _init_online(
        self,
        *,
        hostname: str,
        catalog_id: str | int,
        cache: "SchemaCache",
        ml_schema: str,
        domain_schemas: "str | set[str] | None",
        default_schema: "str | None",
    ) -> None:
        """Online init: connect to server, fetch the live schema, build the model.

        Schema freshness is handled entirely by deriva-py's
        ``ErmrestCatalog``: it caches the parsed ``/schema`` dict on
        the catalog instance, auto-invalidates on any same-instance
        schema-mutating POST/PUT/DELETE, and uses HTTP ETags
        (``If-None-Match``) for cross-instance freshness on every
        read. ``_init_online`` does not maintain its own schema cache.

        The disk cache write below is **only** for offline-mode
        bootstrapping: a later ``DerivaML(mode=offline)`` in the same
        working directory reads the JSON we write here. Online mode
        never reads it back.

        No pre-flight auth probe: the legacy ``/authn/session`` endpoint
        that deriva-py's ``get_authn_session()`` calls is not exposed by
        credenza, which is now the only supported auth backend.
        Authentication failures surface as 401s on the first real
        ermrest call (``getCatalogSchema()`` below) — accurate and
        source-true, instead of a synthetic "you are not authorized"
        wrapper masking the real cause.
        """
        from deriva_ml.model.catalog import DerivaModel

        server = DerivaServer(
            "https",
            hostname,
            credentials=self.credential,
            session_config=self._get_session_config(),
        )
        self.catalog = server.connect_ermrest(catalog_id)

        # Fetch the live schema. deriva-py caches the parsed dict on
        # the catalog instance and auto-invalidates on schema-mutating
        # POST/PUT/DELETE through the same catalog, so subsequent
        # reads in the same process are O(1) and always current.
        # The disk cache write below is purely for offline mode.
        schema_json = self.catalog.getCatalogSchema()
        live_snapshot_id = self.catalog.get("/").json()["snaptime"]
        cache.write(
            snapshot_id=live_snapshot_id,
            hostname=hostname,
            catalog_id=str(catalog_id),
            ml_schema=ml_schema,
            schema=schema_json,
        )

        self.model = DerivaModel.from_cached(
            schema_json,
            catalog=self.catalog,
            ml_schema=ml_schema,
            domain_schemas=domain_schemas,
            default_schema=default_schema,
        )

    def _init_offline(
        self,
        *,
        hostname: str,
        catalog_id: str | int,
        cache: "SchemaCache",
        ml_schema: str,
        domain_schemas: "str | set[str] | None",
        default_schema: "str | None",
    ) -> None:
        """Offline init: read cache, skip all network. Raises if the
        cache is missing or belongs to a different (host, catalog)."""
        from deriva_ml.model.catalog import DerivaModel

        if not cache.exists():
            raise DerivaMLConfigurationError(
                f"offline mode requires a cached schema at {cache._path}; "
                f"run online once first (with the same working_dir) to populate the cache."
            )
        cached = cache.load()
        if cached["hostname"] != hostname or cached["catalog_id"] != str(catalog_id):
            raise DerivaMLConfigurationError(
                f"cached schema at {cache._path} is for "
                f"{cached['hostname']}/{cached['catalog_id']}, "
                f"but __init__ was called with {hostname}/{catalog_id}. "
                f"Use the matching working_dir or run online to refresh."
            )
        self.catalog = CatalogStub()
        self.model = DerivaModel.from_cached(
            cached["schema"],
            catalog=self.catalog,
            ml_schema=ml_schema,
            domain_schemas=domain_schemas,
            default_schema=default_schema,
        )

    def refresh_schema(self, *, force: bool = False) -> None:
        """Force-refetch the live catalog schema; rebuild the model and disk cache.

        Online mode only. For normal in-process use this method is
        rarely needed: deriva-py's ``ErmrestCatalog`` already handles
        schema freshness automatically (auto-invalidation on same-
        instance mutations, ``If-None-Match`` revalidation on every
        read). Use ``refresh_schema()`` when you specifically need to
        bypass the in-process cache and re-fetch from the live
        catalog -- e.g. after a known out-of-band mutation by another
        process, or before overwriting the offline-mode disk cache.

        The disk cache (``SchemaCache``) is rewritten with the fresh
        ``/schema`` JSON so subsequent offline-mode reads see the new
        snapshot. The in-memory ``self.model`` is rebuilt from the
        same fresh JSON.

        Refuses in two cases:

        1. The disk cache is pinned (via :meth:`pin_schema`). Raises
           :class:`DerivaMLSchemaPinned`. ``force=True`` does NOT
           bypass a pin — call :meth:`unpin_schema` first.
        2. The workspace has pending rows (staged/leasing/leased/
           uploading/failed). Raises
           :class:`DerivaMLSchemaRefreshBlocked` unless ``force=True``
           is passed; a forced refresh may leave staged rows whose
           metadata references columns or types no longer in the
           new schema, causing catalog-insert failures on the next
           upload.

        Args:
            force: If True, refresh even when the workspace has
                pending rows. Does NOT bypass a pin.

        Raises:
            DerivaMLOfflineError: If called in offline mode.
            DerivaMLSchemaPinned: If the disk cache is pinned (any
                ``force`` value).
            DerivaMLSchemaRefreshBlocked: If ``force=False`` and the
                workspace has pending rows (and the cache is not
                pinned).
        """
        from deriva_ml.model.catalog import DerivaModel

        if self._mode is not ConnectionMode.online:
            raise DerivaMLOfflineError("refresh_schema requires online mode")
        cache = SchemaCache(self.working_dir)
        if cache.exists() and cache.pin_status().pinned:
            pin_info = cache.pin_status()
            raise DerivaMLSchemaPinned(
                f"refresh_schema refused: cache is pinned at snapshot "
                f"{pin_info.pinned_snapshot_id}"
                + (f" (reason: {pin_info.pin_reason})" if pin_info.pin_reason else "")
                + ". Call ml.unpin_schema() first."
            )
        store = self.workspace.execution_state_store()
        count = store.count_pending_rows()
        if count > 0 and not force:
            raise DerivaMLSchemaRefreshBlocked(
                f"refresh_schema requires a drained workspace; "
                f"{count} pending rows. Run ml.commit_pending_executions() first, "
                f"or call refresh_schema(force=True) to discard local "
                f"state (staged rows may become inconsistent with the "
                f"new schema)."
            )
        # Force a refetch through deriva-py's binding cache; rebuilds
        # the parsed-dict and path-builder caches on the catalog.
        # ``getCatalogSchema()`` is conditionally revalidated via
        # ``If-None-Match`` on every call, so an external mutation is
        # naturally observed; purging the prefix here guarantees the
        # refetch even when the ETag has not changed (defensive: covers
        # cases where the server lies about ETag stability or where the
        # caller explicitly wants the parsed-dict cache invalidated).
        self.catalog.purge_cache_by_prefix("/schema")
        live_schema = self.catalog.getCatalogSchema()
        live_snapshot_id = self.catalog.get("/").json()["snaptime"]
        cache.write(
            snapshot_id=live_snapshot_id,
            hostname=self.host_name,
            catalog_id=str(self.catalog_id),
            ml_schema=self.model.ml_schema,
            schema=live_schema,
        )
        # Reload the in-memory model so this session sees the new schema.
        self.model = DerivaModel.from_cached(
            live_schema,
            catalog=self.catalog,
            ml_schema=self.model.ml_schema,
            domain_schemas=self.model.domain_schemas,
            default_schema=self.model.default_schema,
        )
        logger.info("schema refreshed to snapshot %s", live_snapshot_id)

    def pin_schema(self, reason: str | None = None) -> "SchemaDiff | None":
        """Freeze the local schema cache at its current snapshot.

        While pinned, :meth:`refresh_schema` refuses to update the
        cache (even with ``force=True``). Call :meth:`unpin_schema`
        to clear the pin.

        Online mode additionally checks for structural drift: if the
        live catalog has moved on and its ``/schema`` payload differs
        from the cached one (columns, tables, foreign keys, etc.),
        a :class:`SchemaDiff` describing the drift is returned, and
        a WARNING is logged. The pin is still persisted.

        Offline mode always returns ``None`` — the cache is pinned,
        but no live comparison is possible.

        Args:
            reason: Free-text explanation stored alongside the pin.
                Useful for reporting (``pin_status().pin_reason``).

        Returns:
            A :class:`SchemaDiff` when the pin is applied online and
            the live catalog's schema differs structurally from the
            cache. ``None`` otherwise (offline, no drift, or snapshot
            bumped without schema change).

        Raises:
            FileNotFoundError: If the workspace has no cache yet.
                Run an online ``DerivaML.__init__`` or
                :meth:`refresh_schema` first.
        """
        from deriva_ml.core.schema_diff import _compute_diff

        cache = SchemaCache(self.working_dir)
        drift: SchemaDiff | None = None
        if self._mode is ConnectionMode.online:
            live_snapshot_id = self.catalog.get("/").json()["snaptime"]
            cached_payload = cache.load()
            if cached_payload["snapshot_id"] != live_snapshot_id:
                # See refresh_schema for the purge+get rationale.
                self.catalog.purge_cache_by_prefix("/schema")
                live_schema = self.catalog.getCatalogSchema()
                diff = _compute_diff(cached_payload["schema"], live_schema)
                if not diff.is_empty():
                    logger.warning(
                        "pin_schema: cache at %s, live at %s; structural drift detected (see returned SchemaDiff)",
                        cached_payload["snapshot_id"],
                        live_snapshot_id,
                    )
                    drift = diff
        cache.pin(reason=reason)
        return drift

    def unpin_schema(self) -> None:
        """Clear the schema-cache pin. No-op if not pinned.

        Works in any mode. After unpinning, :meth:`refresh_schema`
        is allowed again (subject to the pending-rows guard).

        Raises:
            FileNotFoundError: If the workspace has no cache file.
        """
        SchemaCache(self.working_dir).unpin()

    def pin_status(self) -> "PinStatus":
        """Return the current pin state of the local schema cache.

        Works in any mode.

        Returns:
            A :class:`PinStatus` snapshot: ``pinned`` flag, UTC
            ``pinned_at`` timestamp (or None), caller-supplied
            ``pin_reason`` (or None), and the cache's current
            ``pinned_snapshot_id``.

        Raises:
            FileNotFoundError: If the workspace has no cache file.
        """
        return SchemaCache(self.working_dir).pin_status()

    def diff_schema(self) -> "SchemaDiff":
        """Return the structural diff between the cached and live schemas.

        Online mode only. Fetches the live catalog's ``/schema``
        payload, compares it against the cached copy with
        :func:`~deriva_ml.core.schema_diff._compute_diff`, and returns
        the result. The returned :class:`SchemaDiff` may be empty
        (no drift) — callers should check ``diff.is_empty()`` rather
        than truthiness.

        Unlike :meth:`pin_schema`, this method never modifies the
        cache and never logs a warning; it is a pure inspection
        operation.

        Returns:
            A :class:`SchemaDiff`, possibly empty.

        Raises:
            DerivaMLOfflineError: If called in offline mode.
            FileNotFoundError: If the workspace has no cache file.
        """
        from deriva_ml.core.schema_diff import _compute_diff

        if self._mode is not ConnectionMode.online:
            raise DerivaMLOfflineError("diff_schema requires online mode")
        cache = SchemaCache(self.working_dir)
        cached_payload = cache.load()
        # See refresh_schema for the purge+get rationale.
        self.catalog.purge_cache_by_prefix("/schema")
        live_schema = self.catalog.getCatalogSchema()
        return _compute_diff(cached_payload["schema"], live_schema)

    @staticmethod
    def _get_session_config() -> dict:
        """Returns customized HTTP session configuration.

        Configures retry behavior and connection settings for HTTP requests to the Deriva server. Settings include:
        - Idempotent retry behavior for all HTTP methods
        - Increased retry attempts for read and connect operations
        - Exponential backoff for retries

        Returns:
            dict: Session configuration dictionary with retry and connection settings.

        Example:
            >>> config = DerivaML._get_session_config()
            >>> config['retry_read']
            8
        """
        # Start with a default configuration
        session_config = DEFAULT_SESSION_CONFIG.copy()

        # Customize retry behavior for robustness
        session_config.update(
            {
                # Allow retries for all HTTP methods (PUT/POST are idempotent)
                "allow_retry_on_all_methods": True,
                # Increase retry attempts for better reliability
                "retry_read": 8,
                "retry_connect": 5,
                # Use exponential backoff for retries
                "retry_backoff_factor": 5,
            }
        )
        return session_config

    def is_snapshot(self) -> bool:
        """Check whether this DerivaML instance is connected to a catalog snapshot.

        Returns:
            True if the underlying catalog has a snapshot timestamp, False otherwise.
        """
        return hasattr(self.catalog, "_snaptime")

    def catalog_snapshot(self, version_snapshot: str) -> Self:
        """Return a new DerivaML instance connected to a specific catalog snapshot.

        Catalog snapshots provide a read-only, point-in-time view of the catalog.
        The snapshot identifier is typically obtained from a dataset version record.

        Every connection-shaping kwarg the original instance was
        constructed with (``working_dir``, ``cache_dir``,
        ``domain_schemas``, ``default_schema``, ``s3_bucket``,
        ``use_minid``, ``credential``, ``mode``, ``ml_schema``,
        ``project_name``, ``clean_execution_dir``, plus the two
        logging levels) is forwarded to the snapshot instance.
        Without this forwarding, the snapshot would silently default
        ``working_dir`` to ``~/deriva_ml`` even when the user
        constructed ``self`` with an explicit shared-tree path, and
        would re-fetch credentials and re-detect domain schemas
        (which can pick differently from the snapshot than the live
        catalog did) — both observable behaviour drifts.

        Args:
            version_snapshot: Snapshot identifier string (e.g., ``"2T-SXEH-JH4A"``),
                usually the ``snapshot`` field from a :class:`DatasetHistory` entry.

        Returns:
            A new DerivaML instance connected to the specified catalog snapshot,
            inheriting every connection-shaping kwarg from ``self``.
        """
        return DerivaML(
            self.host_name,
            version_snapshot,
            domain_schemas=self.domain_schemas,
            default_schema=self.default_schema,
            project_name=self.project_name,
            cache_dir=self.cache_dir,
            working_dir=self.working_dir,
            ml_schema=self.ml_schema,
            logging_level=self._logging_level,
            deriva_logging_level=self._deriva_logging_level,
            credential=self.credential,
            s3_bucket=self.s3_bucket,
            use_minid=self.use_minid,
            clean_execution_dir=self.clean_execution_dir,
            mode=self._mode,
        )

    @property
    def mode(self) -> ConnectionMode:
        """Current connection mode.

        Returns:
            The ConnectionMode this DerivaML instance was constructed
            with. Drives whether writes go live to the catalog (online)
            or stage in SQLite for later upload (offline). See spec §2.1.

        Example:
            >>> ml.mode is ConnectionMode.online  # doctest: +SKIP
            True
        """
        return self._mode

    @property
    def _dataset_table(self) -> Table:
        return self.model.schemas[self.model.ml_schema].tables["Dataset"]

    # pathBuilder, domain_path, table_path moved to PathBuilderMixin

    def download_dir(self, cached: bool = False) -> Path:
        """Returns the appropriate download directory.

        Provides the appropriate directory path for storing downloaded files, either in the cache or working directory.

        Args:
            cached: If True, returns the cache directory path. If False, returns the working directory path.

        Returns:
            Path: Directory path where downloaded files should be stored.

        Example:
            >>> cache_dir = ml.download_dir(cached=True)  # doctest: +SKIP
            >>> work_dir = ml.download_dir(cached=False)  # doctest: +SKIP
        """
        # Return cache directory if cached=True, otherwise working directory
        return self.cache_dir if cached else self.working_dir

    @property
    def workspace(self) -> "Workspace":
        """Per-catalog Workspace for local caching, denormalization, and asset manifests.

        Backed by ``Workspace`` under ``{working_dir}/catalogs/{host}__{cat}/
        working.db``. Shared across invocations of scripts that use the same
        working directory.

        Example::

            # Cache a full table
            df = ml.cache_table("Subject")

            # Check what's cached
            ml.workspace.list_cached_results()
        """
        from deriva_ml.local_db.workspace import Workspace

        if not hasattr(self, "_workspace") or self._workspace is None:
            self._workspace = Workspace(
                working_dir=self.working_dir,
                hostname=self.host_name,
                catalog_id=self.catalog_id,
            )
            # Import any legacy JSON manifests
            try:
                n = self._workspace.import_legacy_manifests()
                if n:
                    logger.info(
                        "Migrated %d legacy asset manifests into workspace",
                        n,
                    )
            except Exception as exc:
                logger.warning(
                    "Legacy manifest migration failed: %s",
                    exc,
                )
            # Build the local schema so the ORM is available. In online
            # mode, refresh the catalog model first so the ORM reflects
            # the actual catalog state at the time workspace is lazily
            # first accessed — the model object captured at
            # DerivaML.__init__ may have been constructed before later
            # catalog mutations (e.g. the test harness calling
            # ``add_dataset_element_type`` to create association tables).
            # Without this refresh, the local schema misses tables that
            # already exist in the catalog.
            #
            # In offline mode, the schema cache IS the authoritative
            # model and refresh_model() would attempt a network call
            # that CatalogStub refuses. Skip it — the cached model was
            # loaded at __init__ time and is what offline callers want.
            if self._mode is ConnectionMode.online:
                self.model.refresh_model()
            self._workspace.build_local_schema(
                model=self.model.model,  # the ERMrest Model object
                schemas=[self.ml_schema, *self.domain_schemas],
            )
        return self._workspace

    def cache_table(self, table_name: str, force: bool = False) -> "pd.DataFrame":
        """Fetch a table from the catalog and cache locally as SQLite.

        On first call, fetches all rows from the catalog and stores in the
        working data cache. Subsequent calls return the cached data without
        contacting the catalog. Use ``force=True`` to re-fetch.

        Args:
            table_name: Name of the table to fetch (e.g., "Subject", "Image").
            force: If True, re-fetch even if already cached.

        Returns:
            DataFrame with the table contents.

        Example::

            subjects = ml.cache_table("Subject")
            print(f"{len(subjects)} subjects")

            # Second call returns cached data instantly
            subjects = ml.cache_table("Subject")
        """
        result = self.workspace.cached_table_read(
            table=table_name,
            source="catalog",
            refresh=force,
        )
        return result.to_dataframe()

    def _cache_features(
        self,
        table_name: str,
        feature_name: str,
        force: bool = False,
        **kwargs,
    ) -> "pd.DataFrame":
        """Fetch feature values from the catalog and cache locally.

        On first call, fetches all feature values and stores in the working
        data cache. Subsequent calls return cached data.

        Args:
            table_name: Table the feature is attached to (e.g., "Image").
            feature_name: Name of the feature (e.g., "Classification").
            force: If True, re-fetch even if already cached.
            **kwargs: Additional arguments passed to ``feature_values``
                (e.g., ``selector``, ``workflow``, ``execution``).

        Returns:
            DataFrame with feature value records.

        Example::

            labels = ml._cache_features("Image", "Classification")
            print(labels["Diagnosis_Type"].value_counts())
        """
        import time

        import pandas as pd

        from deriva_ml.local_db.result_cache import CachedResultMeta, ResultCache

        rc = self.workspace._get_result_cache()
        key = ResultCache.cache_key("features", table=table_name, feature=feature_name)

        if not force and rc.has(key):
            cached = rc.get(key)
            if cached is not None:
                return cached.to_dataframe()

        records = [
            r.model_dump(mode="json") for r in self.feature_values(table_name, feature_name=feature_name, **kwargs)
        ]
        df = pd.DataFrame(records)
        if not df.empty:
            columns = list(df.columns)
            rows = df.to_dict(orient="records")
            meta = CachedResultMeta(
                cache_key=key,
                source="catalog",
                tool_name="features",
                params={"table": table_name, "feature": feature_name},
                columns=columns,
                row_count=len(rows),
                created_at=time.time(),
            )
            rc.store(key, columns, rows, meta)
        return df

    def chaise_url(self, table: RID | Table | str) -> str:
        """Generates Chaise web interface URL.

        Chaise is Deriva's web interface for data exploration. This method creates a URL that directly links to
        the specified table or record.

        Args:
            table: Table to generate URL for (name, Table object, or RID).

        Returns:
            str: URL in format: https://{host}/chaise/recordset/#{catalog}/{schema}:{table}

        Raises:
            DerivaMLException: If table or RID cannot be found.

        Examples:
            Using table name:
                >>> ml.chaise_url("experiment_table")  # doctest: +SKIP
                'https://deriva.org/chaise/recordset/#1/schema:experiment_table'

            Using RID:
                >>> ml.chaise_url("1-abc123")  # doctest: +SKIP
        """
        # Get the table object and build base URI
        table_obj = self.model.name_to_table(table)
        try:
            uri = self.catalog.get_server_uri().replace("ermrest/catalog/", "chaise/recordset/#")
        except DerivaMLException:
            # Handle RID case
            uri = self.cite(cast(str, table))
        return f"{uri}/{urlquote(table_obj.schema.name)}:{urlquote(table_obj.name)}"

    def cite(self, entity: Dict[str, Any] | str, current: bool = False) -> str:
        """Generates citation URL for an entity.

        Creates a URL that can be used to reference a specific entity in the catalog.
        By default, includes the catalog snapshot time to ensure version stability
        (permanent citation). With current=True, returns a URL to the current state.

        Args:
            entity: Either a RID string or a dictionary containing entity data with a 'RID' key.
            current: If True, return URL to current catalog state (no snapshot).
                     If False (default), return permanent citation URL with snapshot time.

        Returns:
            str: Citation URL. Format depends on `current` parameter:
                - current=False: https://{host}/id/{catalog}/{rid}@{snapshot_time}
                - current=True: https://{host}/id/{catalog}/{rid}

        Raises:
            DerivaMLException: If an entity doesn't exist or lacks a RID.

        Examples:
            Permanent citation (default):
                >>> url = ml.cite("1-abc123")  # doctest: +SKIP
                >>> print(url)  # doctest: +SKIP
                'https://deriva.org/id/1/1-abc123@2024-01-01T12:00:00'

            Current catalog URL:
                >>> url = ml.cite("1-abc123", current=True)  # doctest: +SKIP
                >>> print(url)  # doctest: +SKIP
                'https://deriva.org/id/1/1-abc123'

            Using a dictionary:
                >>> url = ml.cite({"RID": "1-abc123"})  # doctest: +SKIP

            Dry-run sentinel — no catalog round-trip, no clickable link:
                >>> url = ml.cite("0000")  # doctest: +SKIP
                >>> print(url)  # doctest: +SKIP
                'dry-run (rid=0000)'
        """
        # Dry-run sentinel: ``run_notebook(dry_run=True)`` and friends
        # hand back ``DRY_RUN_RID`` as the execution RID because no row
        # was created on the catalog. Resolving it would 404. Return a
        # bare, non-link string so notebook templates that embed the
        # output in ``[label]({url})`` markdown render it as plain text
        # rather than a clickable link to a 404.
        rid_value = entity if isinstance(entity, str) else entity.get("RID")
        if rid_value == DRY_RUN_RID:
            return f"dry-run (rid={DRY_RUN_RID})"

        # Return if already a citation URL
        if isinstance(entity, str) and entity.startswith(f"https://{self.host_name}/id/{self.catalog_id}/"):
            return entity

        try:
            # Resolve RID and create citation URL
            self.resolve_rid(rid := entity if isinstance(entity, str) else entity["RID"])
            base_url = f"https://{self.host_name}/id/{self.catalog_id}/{rid}"
            if current:
                return base_url
            return f"{base_url}@{self.catalog.latest_snapshot().snaptime}"
        except KeyError as e:
            raise DerivaMLException(f"Entity {e} does not have RID column")
        except DerivaMLException as _e:
            raise DerivaMLException("Entity RID does not exist")

    @property
    def catalog_provenance(self) -> "CatalogProvenance | None":
        """Get the provenance information for this catalog.

        Returns provenance information if the catalog has it set. This includes
        information about how the catalog was created (clone, create, schema),
        who created it, when, and any workflow information.

        For cloned catalogs, additional details about the clone operation are
        available in the `clone_details` attribute.

        Returns:
            CatalogProvenance if available, None otherwise.

        Example:
            >>> ml = DerivaML('localhost', '45')  # doctest: +SKIP
            >>> prov = ml.catalog_provenance  # doctest: +SKIP
            >>> if prov:  # doctest: +SKIP
            ...     print(f"Created: {prov.created_at} by {prov.created_by}")
            ...     print(f"Method: {prov.creation_method}")
            ...     if prov.is_clone:
            ...         print(f"Cloned from: {prov.clone_details.source_hostname}")
        """
        from deriva_ml.catalog.provenance import get_catalog_provenance

        return get_catalog_provenance(self.catalog)

    # resolve_rid, retrieve_rid moved to RidResolutionMixin

    def apply_catalog_annotations(
        self,
        navbar_brand_text: str = "ML Data Browser",
        head_title: str = "Catalog ML",
    ) -> None:
        """Apply catalog-level annotations including the navigation bar and display settings.

        This method configures the Chaise web interface for the catalog. Chaise is Deriva's
        web-based data browser that provides a user-friendly interface for exploring and
        managing catalog data. This method sets up annotations that control how Chaise
        displays and organizes the catalog.

        **Navigation Bar Structure**:
        The method creates a navigation bar with the following menus:
        - **User Info**: Links to Users, Groups, and RID Lease tables
        - **Deriva-ML**: Core ML tables (Workflow, Execution, Dataset, Dataset_Version, etc.)
        - **WWW**: Web content tables (Page, File)
        - **{Domain Schema}**: All domain-specific tables (excludes vocabularies and associations)
        - **Vocabulary**: All controlled vocabulary tables from both ML and domain schemas
        - **Assets**: All asset tables from both ML and domain schemas
        - **Features**: All feature tables with entries named "TableName:FeatureName"
        - **Catalog Registry**: Link to the ermrest registry
        - **Documentation**: Links to ML notebook instructions and Deriva-ML docs

        **Display Settings**:
        - Underscores in table/column names displayed as spaces
        - System columns (RID) shown in compact and entry views
        - Default table set to Dataset
        - Faceting and record deletion enabled
        - Export configurations available to all users

        **Bulk Upload Configuration**:
        Configures upload patterns for asset tables, enabling drag-and-drop file uploads
        through the Chaise interface.

        Call this after creating the domain schema and all tables to initialize the catalog's
        web interface. The navigation menus are dynamically built based on the current schema
        structure, automatically organizing tables into appropriate categories.

        Args:
            navbar_brand_text: Text displayed in the navigation bar brand area.
            head_title: Title displayed in the browser tab.

        Example:
            >>> ml = DerivaML('deriva.example.org', 'my_catalog')  # doctest: +SKIP
            >>> # After creating domain schema and tables...
            >>> ml.apply_catalog_annotations()  # doctest: +SKIP
            >>> # Or with custom branding:
            >>> ml.apply_catalog_annotations("My Project Browser", "My ML Project")  # doctest: +SKIP
        """
        # Single source of truth lives in
        # :mod:`deriva_ml.schema.annotations`. Delegate to it.
        from deriva_ml.schema.annotations import catalog_annotation

        catalog_annotation(
            self.model,
            navbar_brand_text=navbar_brand_text,
            head_title=head_title,
        )

    def create_vocabulary(
        self, vocab_name: str, comment: str = "", schema: str | None = None, update_navbar: bool = True
    ) -> Table:
        """Creates a controlled vocabulary table.

        A controlled vocabulary table maintains a list of standardized terms and their definitions. Each term can have
        synonyms and descriptions to ensure consistent terminology usage across the dataset.

        Args:
            vocab_name: Name for the new vocabulary table. Must be a valid SQL identifier.
            comment: Description of the vocabulary's purpose and usage. Defaults to empty string.
            schema: Schema name to create the table in. If None, uses domain_schema.
            update_navbar: If True (default), automatically updates the navigation bar to include
                the new vocabulary table. Set to False during batch table creation to avoid
                redundant updates, then call apply_catalog_annotations() once at the end.

        Returns:
            Table: ERMRest table object representing the newly created vocabulary table.

        Raises:
            DerivaMLException: If vocab_name is invalid or already exists.

        Examples:
            Create a vocabulary for tissue types:

                >>> table = ml.create_vocabulary(  # doctest: +SKIP
                ...     vocab_name="tissue_types",
                ...     comment="Standard tissue classifications",
                ...     schema="bio_schema"
                ... )

            Create multiple vocabularies without updating navbar until the end:

                >>> ml.create_vocabulary("Species", update_navbar=False)  # doctest: +SKIP
                >>> ml.create_vocabulary("Tissue_Type", update_navbar=False)  # doctest: +SKIP
                >>> ml.apply_catalog_annotations()  # Update navbar once  # doctest: +SKIP
        """
        # Use default schema if none specified
        schema = schema or self.model._require_default_schema()

        # Create and return vocabulary table with RID-based URI pattern
        try:
            vocab_table = self.model.schemas[schema].create_table(
                VocabularyTableDef(
                    name=vocab_name,
                    curie_template=f"{self.project_name}:{{RID}}",
                    comment=comment,
                )
            )
        except ValueError:
            raise DerivaMLException(f"Table {vocab_name} already exist")

        # Update navbar to include the new vocabulary table
        if update_navbar:
            self.apply_catalog_annotations()

        return vocab_table

    def create_table(self, table: TableDefinition, schema: str | None = None, update_navbar: bool = True) -> Table:
        """Creates a new table in the domain schema.

        Creates a table using the provided TableDefinition object, which specifies the table structure
        including columns, keys, and foreign key relationships. The table is created in the domain
        schema associated with this DerivaML instance.

        **Required Classes**:
        Import the following classes from deriva_ml to define tables:

        - ``TableDefinition``: Defines the complete table structure
        - ``ColumnDefinition``: Defines individual columns with types and constraints
        - ``KeyDefinition``: Defines unique key constraints (optional)
        - ``ForeignKeyDefinition``: Defines foreign key relationships to other tables (optional)
        - ``BuiltinTypes``: Enum of available column data types

        **Available Column Types** (BuiltinTypes enum):
        ``text``, ``int2``, ``int4``, ``int8``, ``float4``, ``float8``, ``boolean``,
        ``date``, ``timestamp``, ``timestamptz``, ``json``, ``jsonb``, ``markdown``,
        ``ermrest_uri``, ``ermrest_rid``, ``ermrest_rcb``, ``ermrest_rmb``,
        ``ermrest_rct``, ``ermrest_rmt``

        Args:
            table: A TableDefinition object containing the complete specification of the table to create.
            update_navbar: If True (default), automatically updates the navigation bar to include
                the new table. Set to False during batch table creation to avoid redundant updates,
                then call apply_catalog_annotations() once at the end.

        Returns:
            Table: The newly created ERMRest table object.

        Raises:
            DerivaMLException: If table creation fails or the definition is invalid.

        Examples:
            **Simple table with basic columns**:

                >>> from deriva_ml import TableDefinition, ColumnDefinition, BuiltinTypes  # doctest: +SKIP
                >>>
                >>> table_def = TableDefinition(  # doctest: +SKIP
                ...     name="Experiment",
                ...     column_defs=[
                ...         ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
                ...         ColumnDefinition(name="Date", type=BuiltinTypes.date),
                ...         ColumnDefinition(name="Description", type=BuiltinTypes.markdown),
                ...         ColumnDefinition(name="Score", type=BuiltinTypes.float4),
                ...     ],
                ...     comment="Records of experimental runs"
                ... )
                >>> experiment_table = ml.create_table(table_def)  # doctest: +SKIP

            **Table with foreign key to another table**:

                >>> from deriva_ml import (  # doctest: +SKIP
                ...     TableDefinition, ColumnDefinition, ForeignKeyDefinition, BuiltinTypes
                ... )
                >>>
                >>> # Create a Sample table that references Subject
                >>> sample_def = TableDefinition(  # doctest: +SKIP
                ...     name="Sample",
                ...     column_defs=[
                ...         ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
                ...         ColumnDefinition(name="Subject", type=BuiltinTypes.text, nullok=False),
                ...         ColumnDefinition(name="Collection_Date", type=BuiltinTypes.date),
                ...     ],
                ...     fkey_defs=[
                ...         ForeignKeyDefinition(
                ...             colnames=["Subject"],
                ...             pk_sname=ml.default_schema,  # Schema of referenced table
                ...             pk_tname="Subject",          # Name of referenced table
                ...             pk_colnames=["RID"],         # Column(s) in referenced table
                ...             on_delete="CASCADE",         # Delete samples when subject deleted
                ...         )
                ...     ],
                ...     comment="Biological samples collected from subjects"
                ... )
                >>> sample_table = ml.create_table(sample_def)  # doctest: +SKIP

            **Table with unique key constraint**:

                >>> from deriva_ml import (  # doctest: +SKIP
                ...     TableDefinition, ColumnDefinition, KeyDefinition, BuiltinTypes
                ... )
                >>>
                >>> protocol_def = TableDefinition(  # doctest: +SKIP
                ...     name="Protocol",
                ...     column_defs=[
                ...         ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
                ...         ColumnDefinition(name="Version", type=BuiltinTypes.text, nullok=False),
                ...         ColumnDefinition(name="Description", type=BuiltinTypes.markdown),
                ...     ],
                ...     key_defs=[
                ...         KeyDefinition(
                ...             colnames=["Name", "Version"],
                ...             constraint_names=[["myschema", "Protocol_Name_Version_key"]],
                ...             comment="Each protocol name+version must be unique"
                ...         )
                ...     ],
                ...     comment="Experimental protocols with versioning"
                ... )
                >>> protocol_table = ml.create_table(protocol_def)  # doctest: +SKIP

            **Batch creation without navbar updates**:

                >>> ml.create_table(table1_def, update_navbar=False)  # doctest: +SKIP
                >>> ml.create_table(table2_def, update_navbar=False)  # doctest: +SKIP
                >>> ml.create_table(table3_def, update_navbar=False)  # doctest: +SKIP
                >>> ml.apply_catalog_annotations()  # Update navbar once at the end  # doctest: +SKIP
        """
        # Use default schema if none specified
        schema = schema or self.model._require_default_schema()

        # Create table in domain schema using provided definition
        # Handle both TableDefinition (dataclass with to_dict) and plain dicts
        table_dict = table.to_dict() if hasattr(table, "to_dict") else table
        new_table = self.model.schemas[schema].create_table(table_dict)

        # Update navbar to include the new table
        if update_navbar:
            self.apply_catalog_annotations()

        return new_table

    def define_association(
        self,
        associates: list,
        metadata: list | None = None,
        table_name: str | None = None,
        comment: str | None = None,
        **kwargs,
    ) -> dict:
        """Build an association table definition with vocab-aware key selection.

        Creates a table definition that links two or more tables via an association
        (many-to-many) table. Non-vocabulary tables automatically use RID as the
        foreign key target, while vocabulary tables use their Name key.

        Use with ``create_table()`` to create the association table in the catalog.

        Args:
            associates: Tables to associate. Each item can be:
                - A Table object
                - A (name, Table) tuple to customize the column name
                - A (name, nullok, Table) tuple for nullable references
                - A Key object for explicit key selection
            metadata: Additional metadata columns or reference targets.
            table_name: Name for the association table. Auto-generated if omitted.
            comment: Comment for the association table.
            **kwargs: Additional arguments passed to Table.define_association.

        Returns:
            Table definition dict suitable for ``create_table()``.

        Example::

            # Associate Image with Subject (many-to-many)
            image_table = ml.model.name_to_table("Image")
            subject_table = ml.model.name_to_table("Subject")
            assoc_def = ml.define_association(
                associates=[image_table, subject_table],
                comment="Links images to subjects",
            )
            ml.create_table(assoc_def)
        """
        return self.model._define_association(
            associates=associates,
            metadata=metadata,
            table_name=table_name,
            comment=comment,
            **kwargs,
        )

    # =========================================================================
    # Cache and Directory Management
    # =========================================================================

    def clear_cache(self, older_than_days: int | None = None) -> dict[str, int]:
        """Clear the dataset cache directory.

        Removes cached dataset bags from the cache directory. Can optionally filter
        by age to only remove old cache entries.

        Args:
            older_than_days: If provided, only remove cache entries older than this
                many days. If None, removes all cache entries.

        Returns:
            dict with keys:
                - 'files_removed': Number of files removed
                - 'dirs_removed': Number of directories removed
                - 'bytes_freed': Total bytes freed
                - 'errors': Number of removal errors

        Example:
            >>> ml = DerivaML('deriva.example.org', 'my_catalog')  # doctest: +SKIP
            >>> # Clear all cache
            >>> result = ml.clear_cache()  # doctest: +SKIP
            >>> print(f"Freed {result['bytes_freed'] / 1e6:.1f} MB")  # doctest: +SKIP
            >>>
            >>> # Clear cache older than 7 days
            >>> result = ml.clear_cache(older_than_days=7)  # doctest: +SKIP
        """
        import shutil
        import time

        stats = {"files_removed": 0, "dirs_removed": 0, "bytes_freed": 0, "errors": 0}

        if not self.cache_dir.exists():
            return stats

        cutoff_time = None
        if older_than_days is not None:
            cutoff_time = time.time() - (older_than_days * 24 * 60 * 60)

        try:
            for entry in self.cache_dir.iterdir():
                try:
                    # Check age if filtering
                    if cutoff_time is not None:
                        entry_mtime = entry.stat().st_mtime
                        if entry_mtime > cutoff_time:
                            continue  # Skip recent entries

                    # Calculate size before removal
                    if entry.is_dir():
                        entry_size = sum(f.stat().st_size for f in entry.rglob("*") if f.is_file())
                        shutil.rmtree(entry)
                        stats["dirs_removed"] += 1
                    else:
                        entry_size = entry.stat().st_size
                        entry.unlink()
                        stats["files_removed"] += 1

                    stats["bytes_freed"] += entry_size
                except (OSError, PermissionError) as e:
                    self._logger.warning(f"Failed to remove cache entry {entry}: {e}")
                    stats["errors"] += 1

        except OSError as e:
            self._logger.error(f"Failed to iterate cache directory: {e}")
            stats["errors"] += 1

        return stats

    def get_cache_size(self) -> dict[str, int | float]:
        """Get the current size of the cache directory.

        Returns:
            dict with keys:
                - 'total_bytes': Total size in bytes
                - 'total_mb': Total size in megabytes
                - 'file_count': Number of files
                - 'dir_count': Number of directories

        Example:
            >>> ml = DerivaML('deriva.example.org', 'my_catalog')  # doctest: +SKIP
            >>> size = ml.get_cache_size()  # doctest: +SKIP
            >>> print(f"Cache size: {size['total_mb']:.1f} MB ({size['file_count']} files)")  # doctest: +SKIP
        """
        stats = {"total_bytes": 0, "total_mb": 0.0, "file_count": 0, "dir_count": 0}

        if not self.cache_dir.exists():
            return stats

        for entry in self.cache_dir.rglob("*"):
            if entry.is_file():
                stats["total_bytes"] += entry.stat().st_size
                stats["file_count"] += 1
            elif entry.is_dir():
                stats["dir_count"] += 1

        stats["total_mb"] = stats["total_bytes"] / (1024 * 1024)
        return stats

    def list_execution_dirs(self) -> list[dict[str, Any]]:
        """List execution working directories.

        Returns information about each execution directory in the working directory,
        useful for identifying orphaned or incomplete execution outputs.

        Returns:
            List of dicts, each containing:
                - 'execution_rid': The execution RID (directory name)
                - 'path': Full path to the directory
                - 'size_bytes': Total size in bytes
                - 'size_mb': Total size in megabytes
                - 'modified': Last modification time (datetime)
                - 'file_count': Number of files

        Example:
            >>> ml = DerivaML('deriva.example.org', 'my_catalog')  # doctest: +SKIP
            >>> dirs = ml.list_execution_dirs()  # doctest: +SKIP
            >>> for d in dirs:  # doctest: +SKIP
            ...     print(f"{d['execution_rid']}: {d['size_mb']:.1f} MB")
        """
        from datetime import datetime

        from deriva_ml.core.upload_layout import upload_root

        results = []
        exec_root = upload_root(self.working_dir) / "execution"

        if not exec_root.exists():
            return results

        for entry in exec_root.iterdir():
            if entry.is_dir():
                size_bytes = sum(f.stat().st_size for f in entry.rglob("*") if f.is_file())
                file_count = sum(1 for f in entry.rglob("*") if f.is_file())
                mtime = datetime.fromtimestamp(entry.stat().st_mtime)

                results.append(
                    {
                        "execution_rid": entry.name,
                        "path": str(entry),
                        "size_bytes": size_bytes,
                        "size_mb": size_bytes / (1024 * 1024),
                        "modified": mtime,
                        "file_count": file_count,
                    }
                )

        return sorted(results, key=lambda x: x["modified"], reverse=True)

    def clean_execution_dirs(
        self,
        older_than_days: int | None = None,
        exclude_rids: list[str] | None = None,
    ) -> dict[str, int]:
        """Clean up execution working directories.

        Removes execution output directories from the local working directory.
        Use this to free up disk space from completed or orphaned executions.

        Args:
            older_than_days: If provided, only remove directories older than this
                many days. If None, removes all execution directories (except excluded).
            exclude_rids: List of execution RIDs to preserve (never remove).

        Returns:
            dict with keys:
                - 'dirs_removed': Number of directories removed
                - 'bytes_freed': Total bytes freed
                - 'errors': Number of removal errors

        Example:
            >>> ml = DerivaML('deriva.example.org', 'my_catalog')  # doctest: +SKIP
            >>> # Clean all execution dirs older than 30 days
            >>> result = ml.clean_execution_dirs(older_than_days=30)  # doctest: +SKIP
            >>> print(f"Freed {result['bytes_freed'] / 1e9:.2f} GB")  # doctest: +SKIP
            >>>
            >>> # Clean all except specific executions
            >>> result = ml.clean_execution_dirs(exclude_rids=['1-ABC', '1-DEF'])  # doctest: +SKIP
        """
        import shutil
        import time

        from deriva_ml.core.upload_layout import upload_root

        stats = {"dirs_removed": 0, "bytes_freed": 0, "errors": 0}
        exclude_rids = set(exclude_rids or [])

        exec_root = upload_root(self.working_dir) / "execution"
        if not exec_root.exists():
            return stats

        cutoff_time = None
        if older_than_days is not None:
            cutoff_time = time.time() - (older_than_days * 24 * 60 * 60)

        for entry in exec_root.iterdir():
            if not entry.is_dir():
                continue

            # Skip excluded RIDs
            if entry.name in exclude_rids:
                continue

            try:
                # Check age if filtering
                if cutoff_time is not None:
                    entry_mtime = entry.stat().st_mtime
                    if entry_mtime > cutoff_time:
                        continue

                # Calculate size before removal
                entry_size = sum(f.stat().st_size for f in entry.rglob("*") if f.is_file())
                shutil.rmtree(entry)
                stats["dirs_removed"] += 1
                stats["bytes_freed"] += entry_size

            except (OSError, PermissionError) as e:
                self._logger.warning(f"Failed to remove execution dir {entry}: {e}")
                stats["errors"] += 1

        return stats

    def get_storage_summary(self) -> dict[str, Any]:
        """Get a summary of local storage usage.

        Returns:
            dict with keys:
                - 'working_dir': Path to working directory
                - 'cache_dir': Path to cache directory
                - 'cache_size_mb': Cache size in MB
                - 'cache_file_count': Number of files in cache
                - 'execution_dir_count': Number of execution directories
                - 'execution_size_mb': Total size of execution directories in MB
                - 'total_size_mb': Combined size in MB

        Example:
            >>> ml = DerivaML('deriva.example.org', 'my_catalog')  # doctest: +SKIP
            >>> summary = ml.get_storage_summary()  # doctest: +SKIP
            >>> print(f"Total storage: {summary['total_size_mb']:.1f} MB")  # doctest: +SKIP
            >>> print(f"  Cache: {summary['cache_size_mb']:.1f} MB")  # doctest: +SKIP
            >>> print(f"  Executions: {summary['execution_size_mb']:.1f} MB")  # doctest: +SKIP
        """
        cache_stats = self.get_cache_size()
        exec_dirs = self.list_execution_dirs()

        exec_size_mb = sum(d["size_mb"] for d in exec_dirs)

        return {
            "working_dir": str(self.working_dir),
            "cache_dir": str(self.cache_dir),
            "cache_size_mb": cache_stats["total_mb"],
            "cache_file_count": cache_stats["file_count"],
            "execution_dir_count": len(exec_dirs),
            "execution_size_mb": exec_size_mb,
            "total_size_mb": cache_stats["total_mb"] + exec_size_mb,
        }

catalog_provenance `property`

catalog_provenance: (
    "CatalogProvenance | None"
)

Get the provenance information for this catalog.

Returns provenance information if the catalog has it set. This includes information about how the catalog was created (clone, create, schema), who created it, when, and any workflow information.

For cloned catalogs, additional details about the clone operation are available in the clone_details attribute.

Returns:

Type	Description
`'CatalogProvenance \| None'`	CatalogProvenance if available, None otherwise.

Example

ml = DerivaML('localhost', '45') # doctest: +SKIP prov = ml.catalog_provenance # doctest: +SKIP if prov: # doctest: +SKIP ... print(f"Created: {prov.created_at} by {prov.created_by}") ... print(f"Method: {prov.creation_method}") ... if prov.is_clone: ... print(f"Cloned from: {prov.clone_details.source_hostname}")

mode `property`

mode: ConnectionMode

Current connection mode.

Returns:

Type	Description
`ConnectionMode`	The ConnectionMode this DerivaML instance was constructed
`ConnectionMode`	with. Drives whether writes go live to the catalog (online)
`ConnectionMode`	or stage in SQLite for later upload (offline). See spec §2.1.

Example

ml.mode is ConnectionMode.online # doctest: +SKIP True

workspace `property`

workspace: 'Workspace'

Per-catalog Workspace for local caching, denormalization, and asset manifests.

Backed by Workspace under {working_dir}/catalogs/{host}__{cat}/ working.db. Shared across invocations of scripts that use the same working directory.

Example::

# Cache a full table
df = ml.cache_table("Subject")

# Check what's cached
ml.workspace.list_cached_results()

del

__del__() -> None

Cleanup method to handle incomplete executions.

Best-effort abort on DerivaML shutdown — only for executions that died mid-flight (i.e., still in Created or Running). Any post-Running status (Stopped, Failed, Pending_Upload, Uploaded, Aborted) is treated as terminal here: the user has either committed cleanly via the context manager or explicitly transitioned the execution, and a forced abort would either be a no-op or a wrongful state change.

Forcing a transition during __del__ is also unsafe at object- teardown time: Python's GC ordering means the underlying ErmrestCatalog HTTP session may already be finalized, in which case the catalog PUT would crash with 'NoneType' object has no attribute 'get' (the catalog's _session reads as None). Limiting the abort to non-terminal states avoids the common case where __exit__ already moved the execution to Stopped and __del__ would otherwise re-transition to Aborted against a dead session.

Source code in src/deriva_ml/core/base.py

def __del__(self) -> None:
    """Cleanup method to handle incomplete executions.

    Best-effort abort on DerivaML shutdown — only for executions that
    died mid-flight (i.e., still in ``Created`` or ``Running``). Any
    post-Running status (``Stopped``, ``Failed``, ``Pending_Upload``,
    ``Uploaded``, ``Aborted``) is treated as terminal here: the user
    has either committed cleanly via the context manager or
    explicitly transitioned the execution, and a forced abort would
    either be a no-op or a wrongful state change.

    Forcing a transition during ``__del__`` is also unsafe at object-
    teardown time: Python's GC ordering means the underlying
    ``ErmrestCatalog`` HTTP session may already be finalized, in
    which case the catalog PUT would crash with ``'NoneType' object
    has no attribute 'get'`` (the catalog's ``_session`` reads as
    ``None``). Limiting the abort to non-terminal states avoids the
    common case where ``__exit__`` already moved the execution to
    ``Stopped`` and ``__del__`` would otherwise re-transition to
    ``Aborted`` against a dead session.
    """
    # Inline import to avoid a circular (core.base ↔ execution.state_store) import.
    try:
        from deriva_ml.execution.state_store import ExecutionStatus

        non_terminal = {ExecutionStatus.Created, ExecutionStatus.Running}
        if self._execution and self._execution.status in non_terminal:
            self._execution.update_status(ExecutionStatus.Aborted, error="Execution Aborted")
    except Exception:
        # Any failure here (catalog unreachable, InvalidTransition, etc.)
        # is swallowed — __del__ must not raise.
        pass

init

__init__(
    hostname: str,
    catalog_id: str | int,
    domain_schemas: str
    | set[str]
    | None = None,
    default_schema: str | None = None,
    project_name: str | None = None,
    cache_dir: str | Path | None = None,
    working_dir: str
    | Path
    | None = None,
    hydra_runtime_output_dir: str
    | Path
    | None = None,
    ml_schema: str = ML_SCHEMA,
    logging_level: int = logging.WARNING,
    deriva_logging_level: int = logging.WARNING,
    credential: dict | None = None,
    s3_bucket: str | None = None,
    use_minid: bool | None = None,
    clean_execution_dir: bool = True,
    mode: ConnectionMode
    | str = ConnectionMode.online,
) -> None

Initializes a DerivaML instance.

This method will connect to a catalog and initialize local configuration for the ML execution. This class is intended to be used as a base class on which domain-specific interfaces are built.

Parameters:

Name	Type	Description	Default
`hostname`	`str`	Hostname of the Deriva server.	required
`catalog_id`	`str \| int`	Catalog ID. Either an identifier or a catalog name.	required
`domain_schemas`	`str \| set[str] \| None`	Optional set of domain schema names. If None, auto-detects all non-system schemas. Use this when working with catalogs that have multiple user-defined schemas.	`None`
`default_schema`	`str \| None`	The default schema for table creation operations. If None and there is exactly one domain schema, that schema is used. If there are multiple domain schemas, this must be specified for table creation to work without explicit schema parameters.	`None`
`ml_schema`	`str`	Schema name for ML schema. Used if you have a non-standard configuration of deriva-ml.	`ML_SCHEMA`
`project_name`	`str \| None`	Project name. Defaults to name of default_schema.	`None`
`cache_dir`	`str \| Path \| None`	Directory path for caching data downloaded from the Deriva server as bdbag. If not provided, will default to working_dir.	`None`
`working_dir`	`str \| Path \| None`	Directory path for storing data used by or generated by any computations. If no value is provided, will default to ${HOME}/deriva_ml	`None`
`s3_bucket`	`str \| None`	S3 bucket URL for dataset bag storage (e.g., 's3://my-bucket'). If provided, enables MINID creation and S3 upload for dataset exports. If None, MINID functionality is disabled regardless of use_minid setting.	`None`
`use_minid`	`bool \| None`	Use the MINID service when downloading dataset bags. Only effective when s3_bucket is configured. If None (default), automatically set to True when s3_bucket is provided, False otherwise.	`None`
`clean_execution_dir`	`bool`	Whether to automatically clean up execution working directories after successful upload. Defaults to True. Set to False to retain local copies.	`True`
`mode`	`ConnectionMode \| str`	Connection mode for this instance. `ConnectionMode.online` (default) sends writes to the catalog eagerly; `ConnectionMode.offline` stages writes into local SQLite for later upload. Accepts the string literals `"online"` or `"offline"`; any other value raises `ValueError`. See spec §2.1.	`online`

Source code in src/deriva_ml/core/base.py

def __init__(
    self,
    hostname: str,
    catalog_id: str | int,
    domain_schemas: str | set[str] | None = None,
    default_schema: str | None = None,
    project_name: str | None = None,
    cache_dir: str | Path | None = None,
    working_dir: str | Path | None = None,
    hydra_runtime_output_dir: str | Path | None = None,
    ml_schema: str = ML_SCHEMA,
    logging_level: int = logging.WARNING,
    deriva_logging_level: int = logging.WARNING,
    credential: dict | None = None,
    s3_bucket: str | None = None,
    use_minid: bool | None = None,
    clean_execution_dir: bool = True,
    mode: ConnectionMode | str = ConnectionMode.online,
) -> None:
    """Initializes a DerivaML instance.

    This method will connect to a catalog and initialize local configuration for the ML execution.
    This class is intended to be used as a base class on which domain-specific interfaces are built.

    Args:
        hostname: Hostname of the Deriva server.
        catalog_id: Catalog ID. Either an identifier or a catalog name.
        domain_schemas: Optional set of domain schema names. If None, auto-detects all
            non-system schemas. Use this when working with catalogs that have multiple
            user-defined schemas.
        default_schema: The default schema for table creation operations. If None and
            there is exactly one domain schema, that schema is used. If there are multiple
            domain schemas, this must be specified for table creation to work without
            explicit schema parameters.
        ml_schema: Schema name for ML schema. Used if you have a non-standard configuration of deriva-ml.
        project_name: Project name. Defaults to name of default_schema.
        cache_dir: Directory path for caching data downloaded from the Deriva server as bdbag. If not provided,
            will default to working_dir.
        working_dir: Directory path for storing data used by or generated by any computations. If no value is
            provided, will default to  ${HOME}/deriva_ml
        s3_bucket: S3 bucket URL for dataset bag storage (e.g., 's3://my-bucket'). If provided,
            enables MINID creation and S3 upload for dataset exports. If None, MINID functionality
            is disabled regardless of use_minid setting.
        use_minid: Use the MINID service when downloading dataset bags. Only effective when
            s3_bucket is configured. If None (default), automatically set to True when s3_bucket
            is provided, False otherwise.
        clean_execution_dir: Whether to automatically clean up execution working directories
            after successful upload. Defaults to True. Set to False to retain local copies.
        mode: Connection mode for this instance. ``ConnectionMode.online`` (default)
            sends writes to the catalog eagerly; ``ConnectionMode.offline`` stages
            writes into local SQLite for later upload. Accepts the string
            literals ``"online"`` or ``"offline"``; any other value raises
            ``ValueError``. See spec §2.1.
    """
    # Store connection mode (see spec §2.1).
    # Done before catalog connection so subclasses/mixins can read
    # ``self._mode`` during their own setup if needed.
    # ``ConnectionMode(x)`` is idempotent on enum members and coerces
    # strings ("online"/"offline") uniformly; unknown strings raise ValueError.
    self._mode = ConnectionMode(mode)

    # Get or use provided credentials for server access.
    # get_credential() reads ~/.deriva/credential.json; no network.
    self.credential = credential or get_credential(hostname)

    # Set up working and cache directories. Done BEFORE catalog/
    # schema setup so SchemaCache can be constructed for either
    # mode branch below.
    # If working_dir is already provided (e.g. from DerivaMLConfig.instantiate()),
    # use it directly; otherwise compute the default path.
    if working_dir is not None:
        self.working_dir = Path(working_dir).absolute()
    else:
        self.working_dir = DerivaMLConfig.compute_workdir(None, catalog_id, hostname)
    self.working_dir.mkdir(parents=True, exist_ok=True)
    self.hydra_runtime_output_dir = hydra_runtime_output_dir

    self.cache_dir = Path(cache_dir) if cache_dir else self.working_dir / "cache"
    self.cache_dir.mkdir(parents=True, exist_ok=True)

    # Mode-branched init: online connects to the catalog and
    # verifies (or populates) the schema cache; offline reads
    # the cache and skips all network calls.
    cache = SchemaCache(self.working_dir)
    if self._mode is ConnectionMode.online:
        self._init_online(
            hostname=hostname,
            catalog_id=catalog_id,
            cache=cache,
            ml_schema=ml_schema,
            domain_schemas=domain_schemas,
            default_schema=default_schema,
        )
    else:
        self._init_offline(
            hostname=hostname,
            catalog_id=catalog_id,
            cache=cache,
            ml_schema=ml_schema,
            domain_schemas=domain_schemas,
            default_schema=default_schema,
        )

    # Store S3 bucket configuration and resolve use_minid
    self.s3_bucket = s3_bucket
    if use_minid is None:
        # Auto mode: enable MINID if s3_bucket is configured
        self.use_minid = s3_bucket is not None
    elif use_minid and s3_bucket is None:
        # User requested MINID but no S3 bucket configured - disable MINID
        self.use_minid = False
    else:
        self.use_minid = use_minid

    # Set up logging using centralized configuration
    # This configures deriva_ml, Hydra, and deriva-py loggers without
    # affecting the root logger or calling basicConfig()
    self._logger = configure_logging(
        level=logging_level,
        deriva_level=deriva_logging_level,
    )
    self._logging_level = logging_level
    self._deriva_logging_level = deriva_logging_level

    # Apply deriva's default logger overrides for fine-grained control
    _apply_logger_overrides(DEFAULT_LOGGER_OVERRIDES)

    # Store instance configuration
    self.host_name = hostname
    self.catalog_id = catalog_id
    self.ml_schema = ml_schema
    self.configuration = None
    self._execution: Execution | None = None
    self.domain_schemas = self.model.domain_schemas
    self.default_schema = self.model.default_schema
    self.project_name = project_name or self.default_schema or "deriva-ml"
    self.start_time = datetime.now()
    self.clean_execution_dir = clean_execution_dir

add_dataset_element_type

add_dataset_element_type(
    element: str | Table,
) -> Table

Make it possible to add objects from element table to a dataset.

Creates a new association table linking Dataset to the given table, then updates catalog annotations so the new type is included in bag-export specs. If the workspace ORM was already built, it is rebuilt to pick up the new association table — the ORM is eagerly constructed at init time and does not see DDL changes applied after that point.

Parameters:

Name	Type	Description	Default
`element`	`str \| Table`	Name of the table (str) or Table object to register as a valid dataset element type.	required

Returns:

Type	Description
`Table`	The Table object that was registered.

Raises:

Type	Description
`DerivaMLException`	If `element` is not a valid table name.
`DerivaMLTableTypeError`	If the table is a system or ML table and cannot be a dataset element type.

Example

ml.add_dataset_element_type("Image") # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/dataset.py

@validate_call(config=VALIDATION_CONFIG)
def add_dataset_element_type(self, element: str | Table) -> Table:
    """Make it possible to add objects from ``element`` table to a dataset.

    Creates a new association table linking Dataset to the given table,
    then updates catalog annotations so the new type is included in
    bag-export specs. If the workspace ORM was already built, it is
    rebuilt to pick up the new association table — the ORM is eagerly
    constructed at init time and does not see DDL changes applied after
    that point.

    Args:
        element: Name of the table (str) or Table object to register as
            a valid dataset element type.

    Returns:
        The Table object that was registered.

    Raises:
        DerivaMLException: If ``element`` is not a valid table name.
        DerivaMLTableTypeError: If the table is a system or ML table
            and cannot be a dataset element type.

    Example:
        >>> ml.add_dataset_element_type("Image")  # doctest: +SKIP
    """
    # Import here to avoid circular imports.
    from deriva_ml.dataset.bag_builder import DatasetBagBuilder

    # Add table to map.
    element_table = self.model.name_to_table(element)
    atable_def = self.model._define_association(
        associates=[self._dataset_table, element_table],
    )
    try:
        table = self.model.create_table(atable_def)
    except ValueError as e:
        if "already exists" in str(e):
            table = self.model.name_to_table(atable_def["table_name"])
        else:
            raise e

    # Rebuild the workspace ORM so it can resolve the new association table.
    # The workspace ORM is built eagerly at init time from the schema snapshot;
    # DDL applied after that point (like this new association table) is not
    # visible until the ORM is rebuilt from a fresh model fetch.
    if getattr(self, "_workspace", None) is not None:
        ls = getattr(self._workspace, "local_schema", None)
        if ls is not None:
            # Fresh model fetch so the rebuild sees the newly-added
            # association table (the local ermrest Model object may
            # lag behind if the test harness created this table via
            # a side channel).
            self.model.refresh_model()
            self._workspace.rebuild_schema(
                model=self.model.model,
                schemas=[self.ml_schema, *self.domain_schemas],
            )

    # self.model = self.catalog.getCatalogModel()
    annotations = DatasetBagBuilder(
        ml_instance=self,
        s3_bucket=self.s3_bucket,
        use_minid=self.use_minid,
    ).generate_dataset_download_annotations()  # type: ignore[arg-type]
    self._dataset_table.annotations.update(annotations)
    self.model.model.apply()
    return table

add_features

add_features(*args, **kwargs) -> int

Retired — use exe.add_features(records) inside an execution context.

DerivaML.add_features has been removed. Feature writes must go through the execution context so that provenance is tracked and values are staged for atomic upload.

Replacement::

with ml.create_execution(config).execute() as exe:
    exe.add_features(records)

Raises:

Type	Description
`DerivaMLException`	Always. Points at the replacement API.

Source code in src/deriva_ml/core/mixins/feature.py

def add_features(self, *args, **kwargs) -> int:
    """Retired — use ``exe.add_features(records)`` inside an execution context.

    ``DerivaML.add_features`` has been removed. Feature writes must go
    through the execution context so that provenance is tracked and values
    are staged for atomic upload.

    Replacement::

        with ml.create_execution(config).execute() as exe:
            exe.add_features(records)

    Raises:
        DerivaMLException: Always. Points at the replacement API.
    """
    raise DerivaMLException(
        "DerivaML.add_features() has been retired. "
        "Use exe.add_features(records) inside an execution context: "
        "``with ml.create_execution(config).execute() as exe: exe.add_features(records)``"
    )

add_files

add_files(
    files: Iterable[FileSpec],
    execution_rid: RID,
    dataset_types: str
    | list[str]
    | None = None,
    description: str = "",
) -> "Dataset"

Adds files to the catalog with their metadata.

Registers files in the catalog along with their metadata (MD5, length, URL) and associates them with specified file types. Links files to the specified execution record for provenance tracking.

Parameters:

Name	Type	Description	Default
`files`	`Iterable[FileSpec]`	File specifications containing MD5 checksum, length, and URL.	required
`execution_rid`	`RID`	Execution RID to associate files with (required for provenance).	required
`dataset_types`	`str \| list[str] \| None`	One or more dataset type terms from File_Type vocabulary.	`None`
`description`	`str`	Description of the files.	`''`

Returns:

Name	Type	Description
`Dataset`	`'Dataset'`	Dataset that represents the newly added files.

Raises:

Type	Description
`DerivaMLException`	If file_types are invalid or execution_rid is not an execution record.

Examples:

Add files via an execution: >>> with ml.create_execution(config) as exe: # doctest: +SKIP ... files = [FileSpec(url="path/to/file.txt", md5="abc123", length=1000)] ... dataset = exe.add_files(files, dataset_types="text")

Source code in src/deriva_ml/core/mixins/file.py

def add_files(
    self,
    files: Iterable[FileSpec],
    execution_rid: RID,
    dataset_types: str | list[str] | None = None,
    description: str = "",
) -> "Dataset":
    """Adds files to the catalog with their metadata.

    Registers files in the catalog along with their metadata (MD5, length, URL) and associates them with
    specified file types. Links files to the specified execution record for provenance tracking.

    Args:
        files: File specifications containing MD5 checksum, length, and URL.
        execution_rid: Execution RID to associate files with (required for provenance).
        dataset_types: One or more dataset type terms from File_Type vocabulary.
        description: Description of the files.

    Returns:
        Dataset: Dataset that represents the newly added files.

    Raises:
        DerivaMLException: If file_types are invalid or execution_rid is not an execution record.

    Examples:
        Add files via an execution:
            >>> with ml.create_execution(config) as exe:  # doctest: +SKIP
            ...     files = [FileSpec(url="path/to/file.txt", md5="abc123", length=1000)]
            ...     dataset = exe.add_files(files, dataset_types="text")
    """
    # Import here to avoid circular imports
    from deriva_ml.dataset.dataset import Dataset

    if self.resolve_rid(execution_rid).table.name != "Execution":
        raise DerivaMLTableTypeError("Execution", execution_rid)

    filespec_list = list(files)

    # Get a list of all defined file types and their synonyms.
    defined_types = set(
        chain.from_iterable(
            [[t.name] + list(t.synonyms or []) for t in self.list_vocabulary_terms(MLVocab.asset_type)]
        )
    )

    # Get a list of all of the file types used in the filespec_list
    spec_types = set(chain.from_iterable(filespec.file_types for filespec in filespec_list))

    # Now make sure that all of the file types and dataset_types in the spec list are defined.
    if spec_types - defined_types:
        raise DerivaMLInvalidTerm(MLVocab.asset_type.name, f"{spec_types - defined_types}")

    # Normalize dataset_types, make sure File type is included.
    if isinstance(dataset_types, list):
        dataset_types = ["File"] + dataset_types if "File" not in dataset_types else dataset_types
    else:
        dataset_types = ["File", dataset_types] if dataset_types else ["File"]
    for ds_type in dataset_types:
        self.lookup_term(MLVocab.dataset_type, ds_type)

    # Add files to the file table, and collect up the resulting entries by directory name.
    pb = self.pathBuilder()
    file_records = list(
        pb.schemas[self.ml_schema].tables["File"].insert([f.model_dump(by_alias=True) for f in filespec_list])
    )

    # Get the name of the association table between file_table and file_type and add file_type records
    atable = self.model.find_association(MLTable.file, MLVocab.asset_type)[0].name
    # Need to get a link between file record and file_types.
    type_map = {
        file_spec.md5: file_spec.file_types + ([] if "File" in file_spec.file_types else [])
        for file_spec in filespec_list
    }
    file_type_records = [
        {MLVocab.asset_type.value: file_type, "File": file_record["RID"]}
        for file_record in file_records
        for file_type in type_map[file_record["MD5"]]
    ]
    pb.schemas[self.ml_schema].tables[atable].insert(file_type_records)

    # Link files to the execution for provenance tracking.
    pb.schemas[self.ml_schema].File_Execution.insert(
        [
            {"File": file_record["RID"], "Execution": execution_rid, "Asset_Role": "Output"}
            for file_record in file_records
        ]
    )

    # Now create datasets to capture the original directory structure of the files.
    dir_rid_map = defaultdict(list)
    for e in file_records:
        dir_rid_map[Path(urlsplit(e["URL"]).path).parent].append(e["RID"])

    nested_datasets = []
    path_length = 0
    dataset = None
    # Start with the longest path so we get subdirectories first.
    for p, rids in sorted(dir_rid_map.items(), key=lambda kv: len(kv[0].parts), reverse=True):
        dataset = Dataset.create_dataset(
            self,  # type: ignore[arg-type]
            dataset_types=dataset_types,
            execution_rid=execution_rid,
            description=description,
        )
        members = rids
        if len(p.parts) < path_length:
            # Going up one level in directory, so Create nested dataset
            members = [m.dataset_rid for m in nested_datasets] + rids
            nested_datasets = []
        dataset.add_dataset_members(members=members, execution_rid=execution_rid)
        nested_datasets.append(dataset)
        path_length = len(p.parts)

    return dataset

add_term

add_term(
    table: str | Table,
    term_name: str,
    description: str,
    synonyms: list[str] | None = None,
    exists_ok: bool = True,
) -> VocabularyTermHandle

Adds a term to a vocabulary table.

Creates a new standardized term with description and optional synonyms in a vocabulary table. Can either create a new term or return an existing one if it already exists.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Vocabulary table to add term to (name or Table object).	required
`term_name`	`str`	Primary name of the term (must be unique within vocabulary).	required
`description`	`str`	Explanation of term's meaning and usage.	required
`synonyms`	`list[str] \| None`	Alternative names for the term.	`None`
`exists_ok`	`bool`	If True, return the existing term if found. If False, raise error.	`True`

Returns:

Name	Type	Description
`VocabularyTermHandle`	`VocabularyTermHandle`	Object representing the created or existing term, with methods to modify it in the catalog.

Raises:

Type	Description
`DerivaMLException`	If a term exists and exists_ok=False, or if the table is not a vocabulary table.

Examples:

Add a new tissue type: >>> term = ml.add_term( # doctest: +SKIP ... table="tissue_types", ... term_name="epithelial", ... description="Epithelial tissue type", ... synonyms=["epithelium"] ... ) >>> # Modify the term >>> term.description = "Updated description" # doctest: +SKIP >>> term.synonyms = ("epithelium", "epithelial_tissue") # doctest: +SKIP

Attempt to add an existing term: >>> term = ml.add_term("tissue_types", "epithelial", "...", exists_ok=True) # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/vocabulary.py

@validate_call(config=VALIDATION_CONFIG)
def add_term(
    self,
    table: str | Table,
    term_name: str,
    description: str,
    synonyms: list[str] | None = None,
    exists_ok: bool = True,
) -> VocabularyTermHandle:
    """Adds a term to a vocabulary table.

    Creates a new standardized term with description and optional synonyms in a vocabulary table.
    Can either create a new term or return an existing one if it already exists.

    Args:
        table: Vocabulary table to add term to (name or Table object).
        term_name: Primary name of the term (must be unique within vocabulary).
        description: Explanation of term's meaning and usage.
        synonyms: Alternative names for the term.
        exists_ok: If True, return the existing term if found. If False, raise error.

    Returns:
        VocabularyTermHandle: Object representing the created or existing term, with
            methods to modify it in the catalog.

    Raises:
        DerivaMLException: If a term exists and exists_ok=False, or if the table is not a vocabulary table.

    Examples:
        Add a new tissue type:
            >>> term = ml.add_term(  # doctest: +SKIP
            ...     table="tissue_types",
            ...     term_name="epithelial",
            ...     description="Epithelial tissue type",
            ...     synonyms=["epithelium"]
            ... )
            >>> # Modify the term
            >>> term.description = "Updated description"  # doctest: +SKIP
            >>> term.synonyms = ("epithelium", "epithelial_tissue")  # doctest: +SKIP

        Attempt to add an existing term:
            >>> term = ml.add_term("tissue_types", "epithelial", "...", exists_ok=True)  # doctest: +SKIP
    """
    # Initialize an empty synonyms list if None
    synonyms = synonyms or []

    # Get table reference and validate if it is a vocabulary table
    vocab_table = self.model.name_to_table(table)
    pb = self.pathBuilder()
    if not (self.model.is_vocabulary(vocab_table)):
        raise DerivaMLTableTypeError("vocabulary", vocab_table.name)

    # Get schema and table names for path building
    schema_name = vocab_table.schema.name
    table_name = vocab_table.name
    cols = self.model.vocab_columns(vocab_table)

    try:
        # Attempt to insert a new term
        term_data = (
            pb.schemas[schema_name]
            .tables[table_name]
            .insert(
                [
                    {
                        cols["Name"]: term_name,
                        cols["Description"]: description,
                        cols["Synonyms"]: synonyms,
                    }
                ],
                defaults={cols["ID"], cols["URI"]},
            )[0]
        )
        term_handle = VocabularyTermHandle(ml=self, table=table_name, **term_data)
        # Invalidate cache for this vocabulary since we added a new term
        self.clear_vocabulary_cache(vocab_table)
        return term_handle
    except DataPathException as e:
        # Insert failed — check if it's because the term already exists
        # or because of some other database error (permissions, schema, etc.)
        try:
            existing_term = self.lookup_term(vocab_table, term_name)
        except DerivaMLInvalidTerm:
            # Term doesn't exist — the insert failed for another reason
            raise DerivaMLException(f"Failed to insert term '{term_name}' into {vocab_table.name}: {e}") from e
        # Term does exist — either return it or raise depending on exists_ok
        if not exists_ok:
            raise DerivaMLInvalidTerm(vocab_table.name, term_name, msg="term already exists")
        return existing_term

add_visible_column

add_visible_column(
    table: str | Table,
    context: str,
    column: str
    | list[str]
    | dict[str, Any],
    position: int | None = None,
) -> list[Any]

Add a column to the visible-columns list for a specific context.

Convenience method for adding columns without replacing the entire visible-columns annotation. Changes are staged until apply_annotations() is called.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Table name (str) or `Table` object.	required
`context`	`str`	The context to modify (e.g., `"compact"`, `"detailed"`, `"entry"`).	required
`column`	`str \| list[str] \| dict[str, Any]`	Column to add. Can be: - str: column name (e.g., `"Filename"`) - list: foreign key reference (e.g., `["schema", "fkey_name"]`) - dict: pseudo-column definition	required
`position`	`int \| None`	Position to insert at (0-indexed). If `None`, appends to the end.	`None`

Returns:

Type	Description
`list[Any]`	The updated column list for the context.

Raises:

Type	Description
`DerivaMLTableTypeError`	If `table` is not found in the catalog model.
`DerivaMLException`	If `context` references another context string rather than a list.

Example

ml.add_visible_column("Image", "compact", "Description") # doctest: +SKIP ml.add_visible_column("Image", "detailed", ["domain", "Image_Subject_fkey"], 1) # doctest: +SKIP ml.apply_annotations() # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/annotation.py

@validate_call(config=VALIDATION_CONFIG)
def add_visible_column(
    self,
    table: str | Table,
    context: str,
    column: str | list[str] | dict[str, Any],
    position: int | None = None,
) -> list[Any]:
    """Add a column to the visible-columns list for a specific context.

    Convenience method for adding columns without replacing the entire
    visible-columns annotation. Changes are staged until
    ``apply_annotations()`` is called.

    Args:
        table: Table name (str) or ``Table`` object.
        context: The context to modify (e.g., ``"compact"``,
            ``"detailed"``, ``"entry"``).
        column: Column to add. Can be:
            - str: column name (e.g., ``"Filename"``)
            - list: foreign key reference (e.g., ``["schema", "fkey_name"]``)
            - dict: pseudo-column definition
        position: Position to insert at (0-indexed). If ``None``, appends
            to the end.

    Returns:
        The updated column list for the context.

    Raises:
        DerivaMLTableTypeError: If ``table`` is not found in the catalog model.
        DerivaMLException: If ``context`` references another context string
            rather than a list.

    Example:
        >>> ml.add_visible_column("Image", "compact", "Description")  # doctest: +SKIP
        >>> ml.add_visible_column("Image", "detailed", ["domain", "Image_Subject_fkey"], 1)  # doctest: +SKIP
        >>> ml.apply_annotations()  # doctest: +SKIP
    """
    table_obj = self.model.name_to_table(table)

    # Get or create visible_columns annotation
    visible_cols = table_obj.annotations.get(VISIBLE_COLUMNS_TAG, {})
    if visible_cols is None:
        visible_cols = {}

    # Get or create the context list
    context_list = visible_cols.get(context, [])
    if isinstance(context_list, str):
        raise DerivaMLException(
            f"Context '{context}' references another context '{context_list}'. "
            "Set it explicitly first with set_visible_columns()."
        )

    # Make a copy to avoid modifying in place
    context_list = list(context_list)

    # Insert at position or append
    if position is not None:
        context_list.insert(position, column)
    else:
        context_list.append(column)

    # Update the annotation
    visible_cols[context] = context_list
    table_obj.annotations[VISIBLE_COLUMNS_TAG] = visible_cols

    return context_list

add_visible_foreign_key

add_visible_foreign_key(
    table: str | Table,
    context: str,
    foreign_key: list[str]
    | dict[str, Any],
    position: int | None = None,
) -> list[Any]

Add a foreign key to the visible-foreign-keys list for a specific context.

Convenience method for adding related tables without replacing the entire visible-foreign-keys annotation. Changes are staged until apply_annotations() is called.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Table name (str) or `Table` object.	required
`context`	`str`	The context to modify (e.g., `"detailed"` or `"*"`).	required
`foreign_key`	`list[str] \| dict[str, Any]`	Foreign key to add. Can be: - list: inbound FK reference (e.g., `["schema", "Other_Table_fkey"]`) - dict: pseudo-column definition for complex relationships	required
`position`	`int \| None`	Position to insert at (0-indexed). If `None`, appends to the end.	`None`

Returns:

Type	Description
`list[Any]`	The updated foreign key list for the context.

Raises:

Type	Description
`DerivaMLTableTypeError`	If `table` is not found in the catalog model.
`DerivaMLException`	If `context` references another context string rather than a list.

Example

ml.add_visible_foreign_key("Subject", "detailed", ["domain", "Image_Subject_fkey"]) # doctest: +SKIP ml.apply_annotations() # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/annotation.py

@validate_call(config=VALIDATION_CONFIG)
def add_visible_foreign_key(
    self,
    table: str | Table,
    context: str,
    foreign_key: list[str] | dict[str, Any],
    position: int | None = None,
) -> list[Any]:
    """Add a foreign key to the visible-foreign-keys list for a specific context.

    Convenience method for adding related tables without replacing the
    entire visible-foreign-keys annotation. Changes are staged until
    ``apply_annotations()`` is called.

    Args:
        table: Table name (str) or ``Table`` object.
        context: The context to modify (e.g., ``"detailed"`` or ``"*"``).
        foreign_key: Foreign key to add. Can be:
            - list: inbound FK reference (e.g.,
              ``["schema", "Other_Table_fkey"]``)
            - dict: pseudo-column definition for complex relationships
        position: Position to insert at (0-indexed). If ``None``, appends
            to the end.

    Returns:
        The updated foreign key list for the context.

    Raises:
        DerivaMLTableTypeError: If ``table`` is not found in the catalog model.
        DerivaMLException: If ``context`` references another context string
            rather than a list.

    Example:
        >>> ml.add_visible_foreign_key("Subject", "detailed", ["domain", "Image_Subject_fkey"])  # doctest: +SKIP
        >>> ml.apply_annotations()  # doctest: +SKIP
    """
    table_obj = self.model.name_to_table(table)

    # Get or create visible_foreign_keys annotation
    visible_fkeys = table_obj.annotations.get(VISIBLE_FOREIGN_KEYS_TAG, {})
    if visible_fkeys is None:
        visible_fkeys = {}

    # Get or create the context list
    context_list = visible_fkeys.get(context, [])
    if isinstance(context_list, str):
        raise DerivaMLException(
            f"Context '{context}' references another context '{context_list}'. "
            "Set it explicitly first with set_visible_foreign_keys()."
        )

    # Make a copy to avoid modifying in place
    context_list = list(context_list)

    # Insert at position or append
    if position is not None:
        context_list.insert(position, foreign_key)
    else:
        context_list.append(foreign_key)

    # Update the annotation
    visible_fkeys[context] = context_list
    table_obj.annotations[VISIBLE_FOREIGN_KEYS_TAG] = visible_fkeys

    return context_list

apply_annotations

apply_annotations() -> None

Apply all staged annotation changes to the catalog.

Pushes any in-memory annotation changes to the live catalog. Must be called after any sequence of set_* or add_*/remove_* annotation calls to make changes visible in Chaise.

Raises:

Type	Description
`DerivaMLException`	If the catalog is read-only or the apply call fails.

Example

ml.set_display_annotation("Image", {"name": "Scan"}) # doctest: +SKIP ml.apply_annotations() # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/annotation.py

def apply_annotations(self) -> None:
    """Apply all staged annotation changes to the catalog.

    Pushes any in-memory annotation changes to the live catalog. Must
    be called after any sequence of ``set_*`` or ``add_*/remove_*``
    annotation calls to make changes visible in Chaise.

    Raises:
        DerivaMLException: If the catalog is read-only or the apply
            call fails.

    Example:
        >>> ml.set_display_annotation("Image", {"name": "Scan"})  # doctest: +SKIP
        >>> ml.apply_annotations()  # doctest: +SKIP
    """
    self.model.apply()

apply_catalog_annotations

apply_catalog_annotations(
    navbar_brand_text: str = "ML Data Browser",
    head_title: str = "Catalog ML",
) -> None

Apply catalog-level annotations including the navigation bar and display settings.

This method configures the Chaise web interface for the catalog. Chaise is Deriva's web-based data browser that provides a user-friendly interface for exploring and managing catalog data. This method sets up annotations that control how Chaise displays and organizes the catalog.

Navigation Bar Structure: The method creates a navigation bar with the following menus: - User Info: Links to Users, Groups, and RID Lease tables - Deriva-ML: Core ML tables (Workflow, Execution, Dataset, Dataset_Version, etc.) - WWW: Web content tables (Page, File) - {Domain Schema}: All domain-specific tables (excludes vocabularies and associations) - Vocabulary: All controlled vocabulary tables from both ML and domain schemas - Assets: All asset tables from both ML and domain schemas - Features: All feature tables with entries named "TableName:FeatureName" - Catalog Registry: Link to the ermrest registry - Documentation: Links to ML notebook instructions and Deriva-ML docs

Display Settings: - Underscores in table/column names displayed as spaces - System columns (RID) shown in compact and entry views - Default table set to Dataset - Faceting and record deletion enabled - Export configurations available to all users

Bulk Upload Configuration: Configures upload patterns for asset tables, enabling drag-and-drop file uploads through the Chaise interface.

Call this after creating the domain schema and all tables to initialize the catalog's web interface. The navigation menus are dynamically built based on the current schema structure, automatically organizing tables into appropriate categories.

Parameters:

Name	Type	Description	Default
`navbar_brand_text`	`str`	Text displayed in the navigation bar brand area.	`'ML Data Browser'`
`head_title`	`str`	Title displayed in the browser tab.	`'Catalog ML'`

Example

ml = DerivaML('deriva.example.org', 'my_catalog') # doctest: +SKIP

After creating domain schema and tables...

ml.apply_catalog_annotations() # doctest: +SKIP

Or with custom branding:

ml.apply_catalog_annotations("My Project Browser", "My ML Project") # doctest: +SKIP

Source code in src/deriva_ml/core/base.py

def apply_catalog_annotations(
    self,
    navbar_brand_text: str = "ML Data Browser",
    head_title: str = "Catalog ML",
) -> None:
    """Apply catalog-level annotations including the navigation bar and display settings.

    This method configures the Chaise web interface for the catalog. Chaise is Deriva's
    web-based data browser that provides a user-friendly interface for exploring and
    managing catalog data. This method sets up annotations that control how Chaise
    displays and organizes the catalog.

    **Navigation Bar Structure**:
    The method creates a navigation bar with the following menus:
    - **User Info**: Links to Users, Groups, and RID Lease tables
    - **Deriva-ML**: Core ML tables (Workflow, Execution, Dataset, Dataset_Version, etc.)
    - **WWW**: Web content tables (Page, File)
    - **{Domain Schema}**: All domain-specific tables (excludes vocabularies and associations)
    - **Vocabulary**: All controlled vocabulary tables from both ML and domain schemas
    - **Assets**: All asset tables from both ML and domain schemas
    - **Features**: All feature tables with entries named "TableName:FeatureName"
    - **Catalog Registry**: Link to the ermrest registry
    - **Documentation**: Links to ML notebook instructions and Deriva-ML docs

    **Display Settings**:
    - Underscores in table/column names displayed as spaces
    - System columns (RID) shown in compact and entry views
    - Default table set to Dataset
    - Faceting and record deletion enabled
    - Export configurations available to all users

    **Bulk Upload Configuration**:
    Configures upload patterns for asset tables, enabling drag-and-drop file uploads
    through the Chaise interface.

    Call this after creating the domain schema and all tables to initialize the catalog's
    web interface. The navigation menus are dynamically built based on the current schema
    structure, automatically organizing tables into appropriate categories.

    Args:
        navbar_brand_text: Text displayed in the navigation bar brand area.
        head_title: Title displayed in the browser tab.

    Example:
        >>> ml = DerivaML('deriva.example.org', 'my_catalog')  # doctest: +SKIP
        >>> # After creating domain schema and tables...
        >>> ml.apply_catalog_annotations()  # doctest: +SKIP
        >>> # Or with custom branding:
        >>> ml.apply_catalog_annotations("My Project Browser", "My ML Project")  # doctest: +SKIP
    """
    # Single source of truth lives in
    # :mod:`deriva_ml.schema.annotations`. Delegate to it.
    from deriva_ml.schema.annotations import catalog_annotation

    catalog_annotation(
        self.model,
        navbar_brand_text=navbar_brand_text,
        head_title=head_title,
    )

asset_record_class

asset_record_class(
    asset_table_name: str,
) -> type

Create a dynamically generated Pydantic model for an asset table's metadata.

The returned class is a subclass of AssetRecord with fields derived from the asset table's metadata columns (non-system, non-standard-asset columns). Fields are typed according to their database column type, and nullable columns are Optional.

Follows the same pattern as Feature.feature_record_class().

Parameters:

Name	Type	Description	Default
`asset_table_name`	`str`	Name of the asset table (e.g., "Image", "Model").	required

Returns:

Type	Description
`type`	An AssetRecord subclass with validated fields matching the table's metadata.

Example

ImageAsset = ml.asset_record_class("Image") # doctest: +SKIP record = ImageAsset(Subject="2-DEF", Acquisition_Date="2026-01-15") # doctest: +SKIP path = exe.asset_file_path("Image", "scan.jpg", metadata=record) # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/asset.py

def asset_record_class(self, asset_table_name: str) -> type:
    """Create a dynamically generated Pydantic model for an asset table's metadata.

    The returned class is a subclass of AssetRecord with fields derived from
    the asset table's metadata columns (non-system, non-standard-asset columns).
    Fields are typed according to their database column type, and nullable columns
    are Optional.

    Follows the same pattern as ``Feature.feature_record_class()``.

    Args:
        asset_table_name: Name of the asset table (e.g., "Image", "Model").

    Returns:
        An AssetRecord subclass with validated fields matching the table's metadata.

    Example:
        >>> ImageAsset = ml.asset_record_class("Image")  # doctest: +SKIP
        >>> record = ImageAsset(Subject="2-DEF", Acquisition_Date="2026-01-15")  # doctest: +SKIP
        >>> path = exe.asset_file_path("Image", "scan.jpg", metadata=record)  # doctest: +SKIP
    """
    from deriva_ml.asset.asset_record import asset_record_class

    return asset_record_class(self.model, asset_table_name)

bag_info

bag_info(
    dataset: "DatasetSpec",
) -> dict[str, Any]

Get comprehensive info about a dataset bag: size, contents, and cache status.

Combines the size estimate with local cache status. Use this to decide whether to prefetch a bag before running an experiment.

Parameters:

Name	Type	Description	Default
`dataset`	`'DatasetSpec'`	Specification of the dataset, including version and optional exclude_tables.	required

Returns:

Type	Description
`dict[str, Any]`	dict with keys: - tables: dict mapping table name to {row_count, is_asset, asset_bytes} - total_rows, total_asset_bytes, total_asset_size - cache_status: one of "not_cached", "cached_materialized", "cached_holey" - cache_path: local path to cached bag (if cached), else None

Source code in src/deriva_ml/core/mixins/dataset.py

def bag_info(
    self,
    dataset: "DatasetSpec",
) -> dict[str, Any]:
    """Get comprehensive info about a dataset bag: size, contents, and cache status.

    Combines the size estimate with local cache status. Use this to decide
    whether to prefetch a bag before running an experiment.

    Args:
        dataset: Specification of the dataset, including version and
            optional exclude_tables.

    Returns:
        dict with keys:
            - tables: dict mapping table name to {row_count, is_asset, asset_bytes}
            - total_rows, total_asset_bytes, total_asset_size
            - cache_status: one of "not_cached", "cached_materialized",
              "cached_holey"
            - cache_path: local path to cached bag (if cached), else None
    """
    if not self.model.is_dataset_rid(dataset.rid):
        raise DerivaMLTableTypeError("Dataset", dataset.rid)
    ds = self.lookup_dataset(dataset)
    return ds.bag_info(
        version=dataset.version,
        exclude_tables=dataset.exclude_tables,
    )

bootstrap_config

bootstrap_config(
    *,
    kinds: list[str] | None = None,
    dataset_type_filter: list[str]
    | None = None,
) -> BootstrapReport

Suggest config entries by reading the catalog.

Walks the catalog and produces structured :class:BootstrapSuggestion objects -- one per dataset / asset / workflow row a fresh project's src/configs/ might want to pin. Does NOT write files. The skill prose layer formats the suggestions into the right config file (per-skill ownership of "which file" -- dataset-lifecycle for datasets, work-with-assets for assets, write-hydra-config for the umbrella).

Three use cases:

New project, empty configs/. Run unfiltered to see every candidate entry; pick the subset that's relevant.
Catalog clone or environment switch. Bootstrap to repoint configs at the new catalog, then validate to catch any stragglers.
Incremental update. Pass kinds=["datasets"] to see fresh dataset suggestions after a release without enumerating assets / workflows.

Parameters:

Name	Type	Description	Default
`kinds`	`list[str] \| None`	Which config groups to suggest entries for. Default is all four (`deriva_ml`, `datasets`, `assets`, `workflow`). Skipping `experiments`, `multiruns`, `model_config` is intentional -- those are project code, not catalog state.	`None`
`dataset_type_filter`	`list[str] \| None`	When suggesting datasets, restrict to these `Dataset_Type` terms. Default is `["Training", "Testing", "Validation", "Complete", "Labeled"]` -- the partition-role + annotation tags experiments typically pin. Pass `[]` (empty list) to include every type. Pass `None` (default) to use the default filter.	`None`

Returns:

Name	Type	Description
`A`	`BootstrapReport`	class:`BootstrapReport` with suggestions grouped (by
	`BootstrapReport`	`kind` field), and a `skipped` list explaining why
	`BootstrapReport`	specific entities weren't suggested.

Example

One-shot bootstrap::

>>> report = ml.bootstrap_config()  # doctest: +SKIP
>>> for s in report.suggestions:  # doctest: +SKIP
...     print(s.kind, s.config_name, s.spec_string)

Source code in src/deriva_ml/core/mixins/dataset.py

def bootstrap_config(
    self,
    *,
    kinds: list[str] | None = None,
    dataset_type_filter: list[str] | None = None,
) -> BootstrapReport:
    """Suggest config entries by reading the catalog.

    Walks the catalog and produces structured :class:`BootstrapSuggestion`
    objects -- one per dataset / asset / workflow row a fresh
    project's ``src/configs/`` might want to pin. Does NOT write
    files. The skill prose layer formats the suggestions into the
    right config file (per-skill ownership of "which file" --
    ``dataset-lifecycle`` for datasets, ``work-with-assets`` for
    assets, ``write-hydra-config`` for the umbrella).

    Three use cases:

    - **New project, empty configs/.** Run unfiltered to see every
      candidate entry; pick the subset that's relevant.
    - **Catalog clone or environment switch.** Bootstrap to repoint
      configs at the new catalog, then validate to catch any
      stragglers.
    - **Incremental update.** Pass ``kinds=["datasets"]`` to see
      fresh dataset suggestions after a release without
      enumerating assets / workflows.

    Args:
        kinds: Which config groups to suggest entries for. Default
            is all four (``deriva_ml``, ``datasets``, ``assets``,
            ``workflow``). Skipping ``experiments``,
            ``multiruns``, ``model_config`` is intentional --
            those are project code, not catalog state.
        dataset_type_filter: When suggesting datasets, restrict
            to these ``Dataset_Type`` terms. Default is
            ``["Training", "Testing", "Validation", "Complete",
            "Labeled"]`` -- the partition-role + annotation tags
            experiments typically pin. Pass ``[]`` (empty list)
            to include every type. Pass ``None`` (default) to
            use the default filter.

    Returns:
        A :class:`BootstrapReport` with suggestions grouped (by
        ``kind`` field), and a ``skipped`` list explaining why
        specific entities weren't suggested.

    Example:
        One-shot bootstrap::

            >>> report = ml.bootstrap_config()  # doctest: +SKIP
            >>> for s in report.suggestions:  # doctest: +SKIP
            ...     print(s.kind, s.config_name, s.spec_string)
    """
    requested_kinds = set(kinds) if kinds is not None else {
        "deriva_ml",
        "datasets",
        "assets",
        "workflow",
    }
    if dataset_type_filter is None:
        type_filter = set(DEFAULT_DATASET_TYPE_FILTER)
    else:
        type_filter = set(dataset_type_filter)  # may be empty (= no filter)

    suggestions: list[BootstrapSuggestion] = []
    skipped: list[BootstrapSkipped] = []

    if "deriva_ml" in requested_kinds:
        suggestions.append(
            BootstrapSuggestion(
                kind="deriva_ml",
                config_name="default_deriva",
                rid="",  # connection groups don't pin a RID
                spec_string=_format_deriva_ml_spec(
                    str(getattr(self, "host_name", "")),
                    str(getattr(self, "catalog_id", "")),
                ),
                description=(
                    f"Connection to {getattr(self, 'host_name', '?')} "
                    f"catalog {getattr(self, 'catalog_id', '?')}"
                ),
                rationale="Connection group; pin this DerivaML instance.",
            )
        )

    if "datasets" in requested_kinds:
        datasets_iter = self.find_datasets()  # type: ignore[attr-defined]
        for ds in datasets_iter:
            ds_rid = ds.dataset_rid
            ds_types = list(ds.dataset_types or [])
            # Apply type filter if non-empty.
            if type_filter and not (set(ds_types) & type_filter):
                skipped.append(
                    BootstrapSkipped(
                        kind="datasets",
                        rid=ds_rid,
                        reason=(
                            f"dataset_types={ds_types} -- not in filter "
                            f"{sorted(type_filter)}"
                        ),
                    )
                )
                continue
            # Need a released version to pin -- dev labels would
            # break reproducibility on consumers.
            current = ds.current_version  # type: ignore[attr-defined]
            if current is None:
                skipped.append(
                    BootstrapSkipped(
                        kind="datasets",
                        rid=ds_rid,
                        reason="no current version",
                    )
                )
                continue
            version_str = str(current)
            if ".dev" in version_str or ".post" in version_str:
                skipped.append(
                    BootstrapSkipped(
                        kind="datasets",
                        rid=ds_rid,
                        reason=(
                            f"current_version={version_str!r} is a dev label; "
                            "call deriva_ml_release(...) to mint a released version"
                        ),
                    )
                )
                continue
            desc = getattr(ds, "description", "") or ""
            config_name = _sanitize_config_name(desc, fallback=ds_rid)
            primary_type = (
                next(iter(set(ds_types) & type_filter), None)
                if type_filter
                else (ds_types[0] if ds_types else "Dataset")
            )
            rationale = (
                f"Dataset type {primary_type or '?'}; latest released "
                f"version {version_str}."
            )
            suggestions.append(
                BootstrapSuggestion(
                    kind="datasets",
                    config_name=config_name,
                    rid=ds_rid,
                    version=version_str,
                    spec_string=_format_dataset_spec(ds_rid, version_str),
                    description=desc or None,
                    rationale=rationale,
                )
            )

    if "assets" in requested_kinds:
        for table in self.list_asset_tables():  # type: ignore[attr-defined]
            # Skip the built-in DerivaML asset tables -- they hold
            # auto-generated metadata files (execution config dumps,
            # uploaded notebooks). Users don't pin those by RID
            # from experiment configs; they're navigated through
            # the producing execution.
            if table.name in {"Execution_Metadata", "Execution_Asset"}:
                skipped.append(
                    BootstrapSkipped(
                        kind="assets",
                        rid=table.name,
                        reason=(
                            "built-in ml-schema asset table; navigate via "
                            "Execution_RID rather than pinning by asset RID"
                        ),
                    )
                )
                continue
            assets = self.list_assets(table)  # type: ignore[attr-defined]
            for asset in assets:
                asset_rid = asset.asset_rid
                filename = getattr(asset, "filename", None) or ""
                config_name = _sanitize_config_name(filename, fallback=asset_rid)
                suggestions.append(
                    BootstrapSuggestion(
                        kind="assets",
                        config_name=config_name,
                        rid=asset_rid,
                        spec_string=_format_asset_spec(asset_rid),
                        description=filename or None,
                        rationale=(
                            f"Asset in {table.name}"
                            + (f" ({filename})" if filename else "")
                        ),
                    )
                )

    if "workflow" in requested_kinds:
        workflows = self.find_workflows()  # type: ignore[attr-defined]
        for wf in workflows:
            wf_rid = getattr(wf, "workflow_rid", None)
            if wf_rid is None:
                continue  # in-memory Workflow without a catalog row
            wf_name = getattr(wf, "name", "") or ""
            config_name = _sanitize_config_name(wf_name, fallback=wf_rid)
            suggestions.append(
                BootstrapSuggestion(
                    kind="workflow",
                    config_name=config_name,
                    rid=wf_rid,
                    spec_string=_format_workflow_spec(wf_rid),
                    description=wf_name or None,
                    rationale=(
                        "Existing Workflow row; pin by RID to reuse "
                        "across executions."
                    ),
                )
            )

    return BootstrapReport(
        catalog={
            "hostname": str(getattr(self, "host_name", "")),
            "catalog_id": str(getattr(self, "catalog_id", "")),
        },
        suggestions=suggestions,
        skipped=skipped,
    )

cache_dataset

cache_dataset(
    dataset: "DatasetSpec",
    materialize: bool = True,
) -> dict[str, Any]

Download a dataset bag into the local cache without creating an execution.

Use this to warm the cache before running experiments. No execution or provenance records are created.

Parameters:

Name	Type	Description	Default
`dataset`	`'DatasetSpec'`	Specification of the dataset, including version and optional exclude_tables.	required
`materialize`	`bool`	If True (default), download all asset files. If False, download only table metadata.	`True`

Returns:

Type	Description
`dict[str, Any]`	dict with bag_info results after caching.

Source code in src/deriva_ml/core/mixins/dataset.py

def cache_dataset(
    self,
    dataset: "DatasetSpec",
    materialize: bool = True,
) -> dict[str, Any]:
    """Download a dataset bag into the local cache without creating an execution.

    Use this to warm the cache before running experiments. No execution or
    provenance records are created.

    Args:
        dataset: Specification of the dataset, including version and
            optional exclude_tables.
        materialize: If True (default), download all asset files. If False,
            download only table metadata.

    Returns:
        dict with bag_info results after caching.
    """
    if not self.model.is_dataset_rid(dataset.rid):
        raise DerivaMLTableTypeError("Dataset", dataset.rid)
    ds = self.lookup_dataset(dataset)
    return ds.cache(
        version=dataset.version,
        materialize=materialize,
        exclude_tables=dataset.exclude_tables,
        timeout=dataset.timeout,
        fetch_concurrency=dataset.fetch_concurrency,
    )

cache_table

cache_table(
    table_name: str, force: bool = False
) -> "pd.DataFrame"

Fetch a table from the catalog and cache locally as SQLite.

On first call, fetches all rows from the catalog and stores in the working data cache. Subsequent calls return the cached data without contacting the catalog. Use force=True to re-fetch.

Parameters:

Name	Type	Description	Default
`table_name`	`str`	Name of the table to fetch (e.g., "Subject", "Image").	required
`force`	`bool`	If True, re-fetch even if already cached.	`False`

Returns:

Type	Description
`'pd.DataFrame'`	DataFrame with the table contents.

Example::

subjects = ml.cache_table("Subject")
print(f"{len(subjects)} subjects")

# Second call returns cached data instantly
subjects = ml.cache_table("Subject")

Source code in src/deriva_ml/core/base.py

def cache_table(self, table_name: str, force: bool = False) -> "pd.DataFrame":
    """Fetch a table from the catalog and cache locally as SQLite.

    On first call, fetches all rows from the catalog and stores in the
    working data cache. Subsequent calls return the cached data without
    contacting the catalog. Use ``force=True`` to re-fetch.

    Args:
        table_name: Name of the table to fetch (e.g., "Subject", "Image").
        force: If True, re-fetch even if already cached.

    Returns:
        DataFrame with the table contents.

    Example::

        subjects = ml.cache_table("Subject")
        print(f"{len(subjects)} subjects")

        # Second call returns cached data instantly
        subjects = ml.cache_table("Subject")
    """
    result = self.workspace.cached_table_read(
        table=table_name,
        source="catalog",
        refresh=force,
    )
    return result.to_dataframe()

catalog_snapshot

catalog_snapshot(
    version_snapshot: str,
) -> Self

Return a new DerivaML instance connected to a specific catalog snapshot.

Catalog snapshots provide a read-only, point-in-time view of the catalog. The snapshot identifier is typically obtained from a dataset version record.

Every connection-shaping kwarg the original instance was constructed with (working_dir, cache_dir, domain_schemas, default_schema, s3_bucket, use_minid, credential, mode, ml_schema, project_name, clean_execution_dir, plus the two logging levels) is forwarded to the snapshot instance. Without this forwarding, the snapshot would silently default working_dir to ~/deriva_ml even when the user constructed self with an explicit shared-tree path, and would re-fetch credentials and re-detect domain schemas (which can pick differently from the snapshot than the live catalog did) — both observable behaviour drifts.

Parameters:

Name	Type	Description	Default
`version_snapshot`	`str`	Snapshot identifier string (e.g., `"2T-SXEH-JH4A"`), usually the `snapshot` field from a :class:`DatasetHistory` entry.	required

Returns:

Type	Description
`Self`	A new DerivaML instance connected to the specified catalog snapshot,
`Self`	inheriting every connection-shaping kwarg from `self`.

Source code in src/deriva_ml/core/base.py

def catalog_snapshot(self, version_snapshot: str) -> Self:
    """Return a new DerivaML instance connected to a specific catalog snapshot.

    Catalog snapshots provide a read-only, point-in-time view of the catalog.
    The snapshot identifier is typically obtained from a dataset version record.

    Every connection-shaping kwarg the original instance was
    constructed with (``working_dir``, ``cache_dir``,
    ``domain_schemas``, ``default_schema``, ``s3_bucket``,
    ``use_minid``, ``credential``, ``mode``, ``ml_schema``,
    ``project_name``, ``clean_execution_dir``, plus the two
    logging levels) is forwarded to the snapshot instance.
    Without this forwarding, the snapshot would silently default
    ``working_dir`` to ``~/deriva_ml`` even when the user
    constructed ``self`` with an explicit shared-tree path, and
    would re-fetch credentials and re-detect domain schemas
    (which can pick differently from the snapshot than the live
    catalog did) — both observable behaviour drifts.

    Args:
        version_snapshot: Snapshot identifier string (e.g., ``"2T-SXEH-JH4A"``),
            usually the ``snapshot`` field from a :class:`DatasetHistory` entry.

    Returns:
        A new DerivaML instance connected to the specified catalog snapshot,
        inheriting every connection-shaping kwarg from ``self``.
    """
    return DerivaML(
        self.host_name,
        version_snapshot,
        domain_schemas=self.domain_schemas,
        default_schema=self.default_schema,
        project_name=self.project_name,
        cache_dir=self.cache_dir,
        working_dir=self.working_dir,
        ml_schema=self.ml_schema,
        logging_level=self._logging_level,
        deriva_logging_level=self._deriva_logging_level,
        credential=self.credential,
        s3_bucket=self.s3_bucket,
        use_minid=self.use_minid,
        clean_execution_dir=self.clean_execution_dir,
        mode=self._mode,
    )

chaise_url

chaise_url(
    table: RID | Table | str,
) -> str

Generates Chaise web interface URL.

Chaise is Deriva's web interface for data exploration. This method creates a URL that directly links to the specified table or record.

Parameters:

Name	Type	Description	Default
`table`	`RID \| Table \| str`	Table to generate URL for (name, Table object, or RID).	required

Returns:

Name	Type	Description
`str`	`str`	URL in format: https://{host}/chaise/recordset/#{catalog}/{schema}:{table}

Raises:

Type	Description
`DerivaMLException`	If table or RID cannot be found.

Examples:

Using table name: >>> ml.chaise_url("experiment_table") # doctest: +SKIP 'https://deriva.org/chaise/recordset/#1/schema:experiment_table'

Using RID: >>> ml.chaise_url("1-abc123") # doctest: +SKIP

Source code in src/deriva_ml/core/base.py

def chaise_url(self, table: RID | Table | str) -> str:
    """Generates Chaise web interface URL.

    Chaise is Deriva's web interface for data exploration. This method creates a URL that directly links to
    the specified table or record.

    Args:
        table: Table to generate URL for (name, Table object, or RID).

    Returns:
        str: URL in format: https://{host}/chaise/recordset/#{catalog}/{schema}:{table}

    Raises:
        DerivaMLException: If table or RID cannot be found.

    Examples:
        Using table name:
            >>> ml.chaise_url("experiment_table")  # doctest: +SKIP
            'https://deriva.org/chaise/recordset/#1/schema:experiment_table'

        Using RID:
            >>> ml.chaise_url("1-abc123")  # doctest: +SKIP
    """
    # Get the table object and build base URI
    table_obj = self.model.name_to_table(table)
    try:
        uri = self.catalog.get_server_uri().replace("ermrest/catalog/", "chaise/recordset/#")
    except DerivaMLException:
        # Handle RID case
        uri = self.cite(cast(str, table))
    return f"{uri}/{urlquote(table_obj.schema.name)}:{urlquote(table_obj.name)}"

cite

cite(
    entity: Dict[str, Any] | str,
    current: bool = False,
) -> str

Generates citation URL for an entity.

Creates a URL that can be used to reference a specific entity in the catalog. By default, includes the catalog snapshot time to ensure version stability (permanent citation). With current=True, returns a URL to the current state.

Parameters:

Name	Type	Description	Default
`entity`	`Dict[str, Any] \| str`	Either a RID string or a dictionary containing entity data with a 'RID' key.	required
`current`	`bool`	If True, return URL to current catalog state (no snapshot). If False (default), return permanent citation URL with snapshot time.	`False`

Returns:

Name	Type	Description
`str`	`str`	Citation URL. Format depends on `current` parameter: - current=False: https://{host}/id/{catalog}/{rid}@{snapshot_time} - current=True: https://{host}/id/{catalog}/{rid}

Raises:

Type	Description
`DerivaMLException`	If an entity doesn't exist or lacks a RID.

Examples:

Permanent citation (default): >>> url = ml.cite("1-abc123") # doctest: +SKIP >>> print(url) # doctest: +SKIP 'https://deriva.org/id/1/1-abc123@2024-01-01T12:00:00'

Current catalog URL: >>> url = ml.cite("1-abc123", current=True) # doctest: +SKIP >>> print(url) # doctest: +SKIP 'https://deriva.org/id/1/1-abc123'

Using a dictionary: >>> url = ml.cite({"RID": "1-abc123"}) # doctest: +SKIP

Dry-run sentinel — no catalog round-trip, no clickable link: >>> url = ml.cite("0000") # doctest: +SKIP >>> print(url) # doctest: +SKIP 'dry-run (rid=0000)'

Source code in src/deriva_ml/core/base.py

def cite(self, entity: Dict[str, Any] | str, current: bool = False) -> str:
    """Generates citation URL for an entity.

    Creates a URL that can be used to reference a specific entity in the catalog.
    By default, includes the catalog snapshot time to ensure version stability
    (permanent citation). With current=True, returns a URL to the current state.

    Args:
        entity: Either a RID string or a dictionary containing entity data with a 'RID' key.
        current: If True, return URL to current catalog state (no snapshot).
                 If False (default), return permanent citation URL with snapshot time.

    Returns:
        str: Citation URL. Format depends on `current` parameter:
            - current=False: https://{host}/id/{catalog}/{rid}@{snapshot_time}
            - current=True: https://{host}/id/{catalog}/{rid}

    Raises:
        DerivaMLException: If an entity doesn't exist or lacks a RID.

    Examples:
        Permanent citation (default):
            >>> url = ml.cite("1-abc123")  # doctest: +SKIP
            >>> print(url)  # doctest: +SKIP
            'https://deriva.org/id/1/1-abc123@2024-01-01T12:00:00'

        Current catalog URL:
            >>> url = ml.cite("1-abc123", current=True)  # doctest: +SKIP
            >>> print(url)  # doctest: +SKIP
            'https://deriva.org/id/1/1-abc123'

        Using a dictionary:
            >>> url = ml.cite({"RID": "1-abc123"})  # doctest: +SKIP

        Dry-run sentinel — no catalog round-trip, no clickable link:
            >>> url = ml.cite("0000")  # doctest: +SKIP
            >>> print(url)  # doctest: +SKIP
            'dry-run (rid=0000)'
    """
    # Dry-run sentinel: ``run_notebook(dry_run=True)`` and friends
    # hand back ``DRY_RUN_RID`` as the execution RID because no row
    # was created on the catalog. Resolving it would 404. Return a
    # bare, non-link string so notebook templates that embed the
    # output in ``[label]({url})`` markdown render it as plain text
    # rather than a clickable link to a 404.
    rid_value = entity if isinstance(entity, str) else entity.get("RID")
    if rid_value == DRY_RUN_RID:
        return f"dry-run (rid={DRY_RUN_RID})"

    # Return if already a citation URL
    if isinstance(entity, str) and entity.startswith(f"https://{self.host_name}/id/{self.catalog_id}/"):
        return entity

    try:
        # Resolve RID and create citation URL
        self.resolve_rid(rid := entity if isinstance(entity, str) else entity["RID"])
        base_url = f"https://{self.host_name}/id/{self.catalog_id}/{rid}"
        if current:
            return base_url
        return f"{base_url}@{self.catalog.latest_snapshot().snaptime}"
    except KeyError as e:
        raise DerivaMLException(f"Entity {e} does not have RID column")
    except DerivaMLException as _e:
        raise DerivaMLException("Entity RID does not exist")

clean_execution_dirs

clean_execution_dirs(
    older_than_days: int | None = None,
    exclude_rids: list[str]
    | None = None,
) -> dict[str, int]

Clean up execution working directories.

Removes execution output directories from the local working directory. Use this to free up disk space from completed or orphaned executions.

Parameters:

Name	Type	Description	Default
`older_than_days`	`int \| None`	If provided, only remove directories older than this many days. If None, removes all execution directories (except excluded).	`None`
`exclude_rids`	`list[str] \| None`	List of execution RIDs to preserve (never remove).	`None`

Returns:

Type	Description
`dict[str, int]`	dict with keys: - 'dirs_removed': Number of directories removed - 'bytes_freed': Total bytes freed - 'errors': Number of removal errors

Example

ml = DerivaML('deriva.example.org', 'my_catalog') # doctest: +SKIP

Clean all execution dirs older than 30 days

result = ml.clean_execution_dirs(older_than_days=30) # doctest: +SKIP print(f"Freed {result['bytes_freed'] / 1e9:.2f} GB") # doctest: +SKIP

Clean all except specific executions

result = ml.clean_execution_dirs(exclude_rids=['1-ABC', '1-DEF']) # doctest: +SKIP

Source code in src/deriva_ml/core/base.py

def clean_execution_dirs(
    self,
    older_than_days: int | None = None,
    exclude_rids: list[str] | None = None,
) -> dict[str, int]:
    """Clean up execution working directories.

    Removes execution output directories from the local working directory.
    Use this to free up disk space from completed or orphaned executions.

    Args:
        older_than_days: If provided, only remove directories older than this
            many days. If None, removes all execution directories (except excluded).
        exclude_rids: List of execution RIDs to preserve (never remove).

    Returns:
        dict with keys:
            - 'dirs_removed': Number of directories removed
            - 'bytes_freed': Total bytes freed
            - 'errors': Number of removal errors

    Example:
        >>> ml = DerivaML('deriva.example.org', 'my_catalog')  # doctest: +SKIP
        >>> # Clean all execution dirs older than 30 days
        >>> result = ml.clean_execution_dirs(older_than_days=30)  # doctest: +SKIP
        >>> print(f"Freed {result['bytes_freed'] / 1e9:.2f} GB")  # doctest: +SKIP
        >>>
        >>> # Clean all except specific executions
        >>> result = ml.clean_execution_dirs(exclude_rids=['1-ABC', '1-DEF'])  # doctest: +SKIP
    """
    import shutil
    import time

    from deriva_ml.core.upload_layout import upload_root

    stats = {"dirs_removed": 0, "bytes_freed": 0, "errors": 0}
    exclude_rids = set(exclude_rids or [])

    exec_root = upload_root(self.working_dir) / "execution"
    if not exec_root.exists():
        return stats

    cutoff_time = None
    if older_than_days is not None:
        cutoff_time = time.time() - (older_than_days * 24 * 60 * 60)

    for entry in exec_root.iterdir():
        if not entry.is_dir():
            continue

        # Skip excluded RIDs
        if entry.name in exclude_rids:
            continue

        try:
            # Check age if filtering
            if cutoff_time is not None:
                entry_mtime = entry.stat().st_mtime
                if entry_mtime > cutoff_time:
                    continue

            # Calculate size before removal
            entry_size = sum(f.stat().st_size for f in entry.rglob("*") if f.is_file())
            shutil.rmtree(entry)
            stats["dirs_removed"] += 1
            stats["bytes_freed"] += entry_size

        except (OSError, PermissionError) as e:
            self._logger.warning(f"Failed to remove execution dir {entry}: {e}")
            stats["errors"] += 1

    return stats

clear_cache

clear_cache(
    older_than_days: int | None = None,
) -> dict[str, int]

Clear the dataset cache directory.

Removes cached dataset bags from the cache directory. Can optionally filter by age to only remove old cache entries.

Parameters:

Name	Type	Description	Default
`older_than_days`	`int \| None`	If provided, only remove cache entries older than this many days. If None, removes all cache entries.	`None`

Returns:

Type	Description
`dict[str, int]`	dict with keys: - 'files_removed': Number of files removed - 'dirs_removed': Number of directories removed - 'bytes_freed': Total bytes freed - 'errors': Number of removal errors

Example

ml = DerivaML('deriva.example.org', 'my_catalog') # doctest: +SKIP

Clear all cache

result = ml.clear_cache() # doctest: +SKIP print(f"Freed {result['bytes_freed'] / 1e6:.1f} MB") # doctest: +SKIP

Clear cache older than 7 days

result = ml.clear_cache(older_than_days=7) # doctest: +SKIP

Source code in src/deriva_ml/core/base.py

def clear_cache(self, older_than_days: int | None = None) -> dict[str, int]:
    """Clear the dataset cache directory.

    Removes cached dataset bags from the cache directory. Can optionally filter
    by age to only remove old cache entries.

    Args:
        older_than_days: If provided, only remove cache entries older than this
            many days. If None, removes all cache entries.

    Returns:
        dict with keys:
            - 'files_removed': Number of files removed
            - 'dirs_removed': Number of directories removed
            - 'bytes_freed': Total bytes freed
            - 'errors': Number of removal errors

    Example:
        >>> ml = DerivaML('deriva.example.org', 'my_catalog')  # doctest: +SKIP
        >>> # Clear all cache
        >>> result = ml.clear_cache()  # doctest: +SKIP
        >>> print(f"Freed {result['bytes_freed'] / 1e6:.1f} MB")  # doctest: +SKIP
        >>>
        >>> # Clear cache older than 7 days
        >>> result = ml.clear_cache(older_than_days=7)  # doctest: +SKIP
    """
    import shutil
    import time

    stats = {"files_removed": 0, "dirs_removed": 0, "bytes_freed": 0, "errors": 0}

    if not self.cache_dir.exists():
        return stats

    cutoff_time = None
    if older_than_days is not None:
        cutoff_time = time.time() - (older_than_days * 24 * 60 * 60)

    try:
        for entry in self.cache_dir.iterdir():
            try:
                # Check age if filtering
                if cutoff_time is not None:
                    entry_mtime = entry.stat().st_mtime
                    if entry_mtime > cutoff_time:
                        continue  # Skip recent entries

                # Calculate size before removal
                if entry.is_dir():
                    entry_size = sum(f.stat().st_size for f in entry.rglob("*") if f.is_file())
                    shutil.rmtree(entry)
                    stats["dirs_removed"] += 1
                else:
                    entry_size = entry.stat().st_size
                    entry.unlink()
                    stats["files_removed"] += 1

                stats["bytes_freed"] += entry_size
            except (OSError, PermissionError) as e:
                self._logger.warning(f"Failed to remove cache entry {entry}: {e}")
                stats["errors"] += 1

    except OSError as e:
        self._logger.error(f"Failed to iterate cache directory: {e}")
        stats["errors"] += 1

    return stats

clear_vocabulary_cache

clear_vocabulary_cache(
    table: str | Table | None = None,
) -> None

Clear the vocabulary term cache.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table \| None`	If provided, only clear cache for this specific vocabulary table. If None, clear the entire cache.	`None`

Source code in src/deriva_ml/core/mixins/vocabulary.py

def clear_vocabulary_cache(self, table: str | Table | None = None) -> None:
    """Clear the vocabulary term cache.

    Args:
        table: If provided, only clear cache for this specific vocabulary table.
               If None, clear the entire cache.
    """
    cache = self._get_vocab_cache()
    if table is None:
        cache.clear()
    else:
        vocab_table = self.model.name_to_table(table)
        cache_key = (vocab_table.schema.name, vocab_table.name)
        cache.pop(cache_key, None)

commit_pending_executions

commit_pending_executions(
    *,
    execution_rids: "list[RID] | None" = None,
    clean_folder: bool = False,
) -> "UploadReport"

Batch-commit pending output assets for one or more executions.

ADR-0009's batch upload entry point. For each requested execution, resumes it from the workspace registry and calls :meth:Execution.commit_output_assets, which brackets the bag-commit with the full lifecycle (Pending_Upload → Uploaded transition, Upload_Duration recording, asset description writes, optional working-folder cleanup).

Failure isolation is per-execution: an exception while committing execution A does not skip execution B; both outcomes appear in the returned :class:UploadReport. The blocking call returns even when one or more executions failed; callers check report.total_failed and report.errors for diagnosis.

This is the engine behind the deriva-ml-upload CLI. Online mode only.

Parameters:

Name	Type	Description	Default
`execution_rids`	`'list[RID] \| None'`	List of RIDs, or `None` to drain every execution that has pending work in the workspace registry. An empty list is treated as "drain nothing" and returns an empty report.	`None`
`clean_folder`	`bool`	Forwarded to :meth:`Execution.commit_output_assets`. When `True`, each execution's working folder is removed after a successful commit. Default `False` preserves on-disk state for inspection.	`False`

Returns:

Type	Description
`'UploadReport'`	UploadReport aggregating per-execution outcomes. Successful
`'UploadReport'`	executions contribute their per-(schema, table) counts to
`'UploadReport'`	`per_table` and their asset-row count to
`'UploadReport'`	`total_uploaded`. Failed executions contribute one entry
`'UploadReport'`	to `total_failed` and one human-readable line to
`'UploadReport'`	`errors` prefixed by `"execution {rid}: "`.

Example

report = ml.commit_pending_executions() # doctest: +SKIP print(f"{report.total_uploaded} uploaded, " # doctest: +SKIP ... f"{report.total_failed} failed")

Source code in src/deriva_ml/core/mixins/execution.py

def commit_pending_executions(
    self,
    *,
    execution_rids: "list[RID] | None" = None,
    clean_folder: bool = False,
) -> "UploadReport":
    """Batch-commit pending output assets for one or more executions.

    ADR-0009's batch upload entry point. For each requested
    execution, resumes it from the workspace registry and calls
    :meth:`Execution.commit_output_assets`, which brackets the
    bag-commit with the full lifecycle (``Pending_Upload →
    Uploaded`` transition, ``Upload_Duration`` recording, asset
    description writes, optional working-folder cleanup).

    Failure isolation is per-execution: an exception while
    committing execution A does not skip execution B; both
    outcomes appear in the returned :class:`UploadReport`. The
    blocking call returns even when one or more executions
    failed; callers check ``report.total_failed`` and
    ``report.errors`` for diagnosis.

    This is the engine behind the ``deriva-ml-upload`` CLI.
    Online mode only.

    Args:
        execution_rids: List of RIDs, or ``None`` to drain every
            execution that has pending work in the workspace
            registry. An empty list is treated as "drain nothing"
            and returns an empty report.
        clean_folder: Forwarded to
            :meth:`Execution.commit_output_assets`. When ``True``,
            each execution's working folder is removed after a
            successful commit. Default ``False`` preserves on-disk
            state for inspection.

    Returns:
        UploadReport aggregating per-execution outcomes. Successful
        executions contribute their per-(schema, table) counts to
        ``per_table`` and their asset-row count to
        ``total_uploaded``. Failed executions contribute one entry
        to ``total_failed`` and one human-readable line to
        ``errors`` prefixed by ``"execution {rid}: "``.

    Example:
        >>> report = ml.commit_pending_executions()  # doctest: +SKIP
        >>> print(f"{report.total_uploaded} uploaded, "  # doctest: +SKIP
        ...       f"{report.total_failed} failed")
    """
    from deriva_ml.execution.upload_report import UploadReport

    # Enumerate executions to drain. Caller-supplied list wins; an
    # empty caller-supplied list is treated as "drain nothing." A
    # ``None`` caller-supplied list means "drain every execution
    # in the workspace registry."
    store = self.workspace.execution_state_store()
    if execution_rids is None:
        rids = [row["rid"] for row in store.list_executions()]
    else:
        rids = list(execution_rids)

    total_uploaded = 0
    total_failed = 0
    per_table: dict[str, dict[str, int]] = {}
    errors: list[str] = []

    for rid in rids:
        try:
            execution = self.resume_execution(rid)  # type: ignore[attr-defined]
            exec_report = execution.commit_output_assets(clean_folder=clean_folder)
        except Exception as e:  # noqa: BLE001 — surface every failure into the report
            # Failure isolation: continue past this execution so
            # sibling executions still get a chance to drain.
            total_failed += 1
            errors.append(f"execution {rid}: {e}")
            continue

        # Aggregate the per-execution UploadReport into the batch
        # totals. The per-table dict gets summed across executions
        # so the caller sees one rolled-up view per asset table.
        total_uploaded += exec_report.total_uploaded
        for fqn, counts in exec_report.per_table.items():
            bucket = per_table.setdefault(fqn, {"uploaded": 0, "failed": 0})
            bucket["uploaded"] += counts.get("uploaded", 0)
            bucket["failed"] += counts.get("failed", 0)

    return UploadReport(
        execution_rids=rids,
        total_uploaded=total_uploaded,
        total_failed=total_failed,
        per_table=per_table,
        errors=errors,
    )

create_asset

create_asset(
    asset_name: str,
    column_defs: Iterable[
        ColumnDefinition
    ]
    | None = None,
    fkey_defs: Iterable[
        ColumnDefinition
    ]
    | None = None,
    referenced_tables: Iterable[Table]
    | None = None,
    comment: str = "",
    schema: str | None = None,
    update_navbar: bool = True,
) -> Table

Create a new asset table in the catalog.

Defines a Chaise-compatible asset table (Filename, URL, Length, MD5, Description, plus system columns) with optional additional metadata columns and foreign-key references. Registers the asset type in the Asset_Type vocabulary and optionally updates the Chaise navigation bar.

Parameters:

Name	Type	Description	Default
`asset_name`	`str`	Name for the new asset table, e.g. `"Image"` or `"ModelWeights"`.	required
`column_defs`	`Iterable[ColumnDefinition] \| None`	Extra metadata columns beyond the standard asset columns. Each is a `ColumnDefinition` specifying name, type, nullability, and comment.	`None`
`fkey_defs`	`Iterable[ColumnDefinition] \| None`	Foreign-key definitions from the asset table to other tables (e.g., linking images to a subject table).	`None`
`referenced_tables`	`Iterable[Table] \| None`	Tables that the new asset table should reference via FKs. Convenience alternative to `fkey_defs` when only a reference to an existing table is needed.	`None`
`comment`	`str`	Human-readable description of the asset table stored as the table comment in the catalog.	`''`
`schema`	`str \| None`	Schema in which to create the table. Defaults to `self.default_schema`.	`None`
`update_navbar`	`bool`	If `True` (default), call `apply_catalog_annotations()` immediately to add the new table to the Chaise navigation bar. Set `False` when creating multiple asset tables in a batch; call `apply_catalog_annotations()` once at the end.	`True`

Returns:

Type	Description
`Table`	The newly created `Table` object.

Raises:

Type	Description
`DerivaMLException`	If a table named `asset_name` already exists in the target schema.
`DerivaMLSchemaError`	If `schema` is not a valid schema in this catalog.

Example

from deriva.core.typed import Column, builtin_types # doctest: +SKIP ml.create_asset( # doctest: +SKIP ... "ScanImage", # doctest: +SKIP ... comment="MRI scan images", # doctest: +SKIP ... ) # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/asset.py

def create_asset(
    self,
    asset_name: str,
    column_defs: Iterable[ColumnDefinition] | None = None,
    fkey_defs: Iterable[ColumnDefinition] | None = None,
    referenced_tables: Iterable[Table] | None = None,
    comment: str = "",
    schema: str | None = None,
    update_navbar: bool = True,
) -> Table:
    """Create a new asset table in the catalog.

    Defines a Chaise-compatible asset table (Filename, URL, Length, MD5,
    Description, plus system columns) with optional additional metadata
    columns and foreign-key references. Registers the asset type in the
    ``Asset_Type`` vocabulary and optionally updates the Chaise navigation
    bar.

    Args:
        asset_name: Name for the new asset table, e.g. ``"Image"`` or
            ``"ModelWeights"``.
        column_defs: Extra metadata columns beyond the standard asset
            columns. Each is a ``ColumnDefinition`` specifying name, type,
            nullability, and comment.
        fkey_defs: Foreign-key definitions from the asset table to other
            tables (e.g., linking images to a subject table).
        referenced_tables: Tables that the new asset table should reference
            via FKs. Convenience alternative to ``fkey_defs`` when only
            a reference to an existing table is needed.
        comment: Human-readable description of the asset table stored
            as the table comment in the catalog.
        schema: Schema in which to create the table. Defaults to
            ``self.default_schema``.
        update_navbar: If ``True`` (default), call
            ``apply_catalog_annotations()`` immediately to add the new
            table to the Chaise navigation bar. Set ``False`` when
            creating multiple asset tables in a batch; call
            ``apply_catalog_annotations()`` once at the end.

    Returns:
        The newly created ``Table`` object.

    Raises:
        DerivaMLException: If a table named ``asset_name`` already exists
            in the target schema.
        DerivaMLSchemaError: If ``schema`` is not a valid schema in this
            catalog.

    Example:
        >>> from deriva.core.typed import Column, builtin_types  # doctest: +SKIP
        >>> ml.create_asset(  # doctest: +SKIP
        ...     "ScanImage",  # doctest: +SKIP
        ...     comment="MRI scan images",  # doctest: +SKIP
        ... )  # doctest: +SKIP
    """
    # Initialize empty collections if None provided
    column_defs = column_defs or []
    fkey_defs = fkey_defs or []
    referenced_tables = referenced_tables or []
    schema = schema or self.model._require_default_schema()

    # Add an asset type to vocabulary
    self.add_term(MLVocab.asset_type, asset_name, description=f"A {asset_name} asset")

    # Create the main asset table
    # Note: column_defs and fkey_defs should be ColumnDef/ForeignKeyDef objects
    asset_table = self.model.schemas[schema].create_table(
        AssetTableDef(
            schema_name=schema,
            name=asset_name,
            columns=list(column_defs),
            foreign_keys=list(fkey_defs),
            comment=comment,
        )
    )

    # Create an association table between asset and asset type
    self.model.create_table(
        self.model._define_association(
            associates=[
                (asset_table.name, asset_table),
                ("Asset_Type", self.model.name_to_table("Asset_Type")),
            ],
        ),
        schema=schema,
    )

    # Create references to other tables if specified
    for t in referenced_tables:
        asset_table.create_reference(self.model.name_to_table(t))

    # Create an association table for tracking execution
    atable = self.model.create_table(
        self.model._define_association(
            associates=[
                (asset_name, asset_table),
                (
                    "Execution",
                    self.model.schemas[self.ml_schema].tables["Execution"],
                ),
            ],
        ),
        schema=schema,
    )
    atable.create_reference(self.model.name_to_table("Asset_Role"))

    # Add asset annotations
    asset_annotation(asset_table)

    # Update navbar to include the new asset table
    if update_navbar:
        self.apply_catalog_annotations()

    return asset_table

create_execution

create_execution(
    configuration: "ExecutionConfiguration | None" = None,
    *,
    datasets: "list[DatasetSpec | str] | None" = None,
    assets: "list[AssetSpec | str] | None" = None,
    workflow: "Workflow | RID | str | None" = None,
    description: "str | None" = None,
    dry_run: bool = False,
) -> "Execution"

Create an execution environment.

Initializes a local compute environment for executing an ML or analytic routine. Accepts either a pre-built :class:ExecutionConfiguration (the config-object form) or individual keyword arguments that the method assembles into an ExecutionConfiguration (the kwargs form). Mixing the two forms is rejected with TypeError — pick one.

Creating executions requires online mode because the Execution RID is server-assigned.

Side effects:

Downloads datasets specified in the configuration to the cache directory. If no version is specified, creates a new minor version for the dataset.
Downloads any execution assets to the working directory.
Creates an execution record in the catalog (unless dry_run=True).

Parameters:

Name	Type	Description	Default
`configuration`	`'ExecutionConfiguration \| None'`	A pre-built ExecutionConfiguration. If this is provided, all of the kwargs below (except `dry_run`) must be None.	`None`
`datasets`	`'list[DatasetSpec \| str] \| None'`	Kwargs form only. List of :class:`DatasetSpec` or `"RID@version"` shorthand strings; strings are coerced via :meth:`DatasetSpec.from_shorthand`.	`None`
`assets`	`'list[AssetSpec \| str] \| None'`	Kwargs form only. List of :class:`AssetSpec` or bare RID strings; strings are coerced to `AssetSpec(rid=...)`.	`None`
`workflow`	`'Workflow \| RID \| str \| None'`	A :class:`Workflow` object, a Workflow RID, or a workflow URL string. Strings that look like URLs are resolved via :meth:`lookup_workflow_by_url`. Accepted in both forms — in the config-object form it supplements / overrides `configuration.workflow`; in the kwargs form it is the workflow for the execution.	`None`
`description`	`'str \| None'`	Kwargs form only. Human-readable description of the execution.	`None`
`dry_run`	`bool`	If True, skip creating catalog records and uploading results.	`False`

Returns:

Name	Type	Description
`An`	`'Execution'`	class:`Execution` object for managing the execution
	`'Execution'`	lifecycle.

Raises:

Type	Description
`TypeError`	If `configuration` is given alongside `datasets`, `assets`, or `description` kwargs.
`DerivaMLOfflineError`	If the current connection mode is :attr:`ConnectionMode.offline`.

Example

Config-object form::

>>> config = ExecutionConfiguration(  # doctest: +SKIP
...     workflow=workflow,
...     description="Process samples",
...     datasets=[DatasetSpec(rid="4HM", version="1.0.0")],
... )
>>> with ml.create_execution(config) as execution:  # doctest: +SKIP
...     # Run analysis
...     pass
>>> execution.commit_output_assets()  # doctest: +SKIP

Kwargs form (equivalent)::

>>> with ml.create_execution(  # doctest: +SKIP
...     datasets=["4HM@1.0.0"],
...     workflow=workflow,
...     description="Process samples",
... ) as execution:
...     # Run analysis
...     pass

Source code in src/deriva_ml/core/mixins/execution.py

def create_execution(
    self,
    configuration: "ExecutionConfiguration | None" = None,
    *,
    datasets: "list[DatasetSpec | str] | None" = None,
    assets: "list[AssetSpec | str] | None" = None,
    workflow: "Workflow | RID | str | None" = None,
    description: "str | None" = None,
    dry_run: bool = False,
) -> "Execution":
    """Create an execution environment.

    Initializes a local compute environment for executing an ML or
    analytic routine. Accepts either a pre-built
    :class:`ExecutionConfiguration` (the config-object form) or
    individual keyword arguments that the method assembles into an
    ``ExecutionConfiguration`` (the kwargs form). Mixing the two
    forms is rejected with ``TypeError`` — pick one.

    Creating executions requires online mode because the Execution
    RID is server-assigned.

    Side effects:

    1. Downloads datasets specified in the configuration to the
       cache directory. If no version is specified, creates a new
       minor version for the dataset.
    2. Downloads any execution assets to the working directory.
    3. Creates an execution record in the catalog (unless
       ``dry_run=True``).

    Args:
        configuration: A pre-built ExecutionConfiguration. If this
            is provided, all of the kwargs below (except
            ``dry_run``) must be None.
        datasets: Kwargs form only. List of :class:`DatasetSpec`
            or ``"RID@version"`` shorthand strings; strings are
            coerced via :meth:`DatasetSpec.from_shorthand`.
        assets: Kwargs form only. List of :class:`AssetSpec` or
            bare RID strings; strings are coerced to
            ``AssetSpec(rid=...)``.
        workflow: A :class:`Workflow` object, a Workflow RID, or a
            workflow URL string. Strings that look like URLs are
            resolved via :meth:`lookup_workflow_by_url`. Accepted
            in both forms — in the config-object form it
            supplements / overrides ``configuration.workflow``; in
            the kwargs form it is the workflow for the execution.
        description: Kwargs form only. Human-readable description
            of the execution.
        dry_run: If True, skip creating catalog records and
            uploading results.

    Returns:
        An :class:`Execution` object for managing the execution
        lifecycle.

    Raises:
        TypeError: If ``configuration`` is given alongside
            ``datasets``, ``assets``, or ``description`` kwargs.
        DerivaMLOfflineError: If the current connection mode is
            :attr:`ConnectionMode.offline`.

    Example:
        Config-object form::

            >>> config = ExecutionConfiguration(  # doctest: +SKIP
            ...     workflow=workflow,
            ...     description="Process samples",
            ...     datasets=[DatasetSpec(rid="4HM", version="1.0.0")],
            ... )
            >>> with ml.create_execution(config) as execution:  # doctest: +SKIP
            ...     # Run analysis
            ...     pass
            >>> execution.commit_output_assets()  # doctest: +SKIP

        Kwargs form (equivalent)::

            >>> with ml.create_execution(  # doctest: +SKIP
            ...     datasets=["4HM@1.0.0"],
            ...     workflow=workflow,
            ...     description="Process samples",
            ... ) as execution:
            ...     # Run analysis
            ...     pass
    """
    # Import here to avoid circular dependency
    from deriva_ml.asset.aux_classes import AssetSpec
    from deriva_ml.core.exceptions import DerivaMLOfflineError
    from deriva_ml.dataset.aux_classes import DatasetSpec
    from deriva_ml.execution.execution import Execution
    from deriva_ml.execution.execution_configuration import ExecutionConfiguration
    from deriva_ml.execution.workflow import Workflow as WorkflowClass

    # Offline guard first — the error should be about the mode,
    # not a downstream validation issue.
    if self._mode is ConnectionMode.offline:
        raise DerivaMLOfflineError(
            "create_execution requires online mode — the Execution "
            "RID is server-assigned. Switch to "
            "ConnectionMode.online to create, then you can run "
            "offline scripts via resume_execution."
        )

    # Reject mixed forms. Note: ``workflow`` and ``dry_run`` are
    # accepted in both forms (workflow had legacy config-form
    # support; dry_run is universal). Only the kwargs-only fields
    # gate the mixing check.
    kwargs_form_only = any(x is not None for x in (datasets, assets, description))
    if configuration is not None and kwargs_form_only:
        raise TypeError(
            "create_execution: cannot mix configuration= with "
            "datasets/assets/description kwargs; pass exactly one "
            "form."
        )

    # Resolve a string workflow to a Workflow object (used by both forms).
    resolved_workflow = workflow
    if isinstance(resolved_workflow, str):
        resolved_workflow = self.lookup_workflow_by_url(resolved_workflow)

    if configuration is None:
        # Kwargs form: assemble an ExecutionConfiguration.
        ds_specs: list[DatasetSpec] = []
        for d in datasets or []:
            if isinstance(d, str):
                ds_specs.append(DatasetSpec.from_shorthand(d))
            else:
                ds_specs.append(d)
        as_specs: list[AssetSpec] = []
        for a in assets or []:
            if isinstance(a, str):
                as_specs.append(AssetSpec(rid=a))
            else:
                as_specs.append(a)

        configuration = ExecutionConfiguration(
            datasets=ds_specs,
            assets=as_specs,
            workflow=resolved_workflow if isinstance(resolved_workflow, WorkflowClass) else None,
            description=description or "",
        )
        # If workflow is a RID (not a Workflow or string URL),
        # pass it through the legacy workflow= parameter so
        # Execution.__init__ can raise its own clear error
        # (our job is just assembly, not re-validation).
        workflow_for_execution = resolved_workflow if not isinstance(resolved_workflow, WorkflowClass) else None
    else:
        # Config-object form: preserve legacy behaviour.
        workflow_for_execution = resolved_workflow

    # Create and store an execution instance
    self._execution = Execution(
        configuration,
        self,
        workflow=workflow_for_execution,
        dry_run=dry_run,
    )  # type: ignore[arg-type]
    return self._execution

create_feature

create_feature(
    target_table: Table | str,
    feature_name: str,
    terms: list[Table | str]
    | None = None,
    assets: list[Table | str]
    | None = None,
    metadata: list[
        ColumnDefinition
        | Table
        | Key
        | str
    ]
    | None = None,
    optional: list[str] | None = None,
    comment: str = "",
    update_navbar: bool = True,
) -> type[FeatureRecord]

Creates a new feature definition.

A feature represents a measurable property or characteristic that can be associated with records in the target table. Features can include vocabulary terms, asset references, and additional metadata.

Side Effects: This method dynamically creates: 1. A new association table in the domain schema to store feature values 2. A Pydantic model class (subclass of FeatureRecord) for creating validated feature instances

The returned Pydantic model class provides type-safe construction of feature records with automatic validation of values against the feature's definition (vocabulary terms, asset references, etc.). Use this class to create feature instances that can be inserted into the catalog.

Parameters:

Name	Type	Description	Default
`target_table`	`Table \| str`	Table to associate the feature with (name or Table object).	required
`feature_name`	`str`	Unique name for the feature within the target table.	required
`terms`	`list[Table \| str] \| None`	Optional vocabulary tables/names whose terms can be used as feature values.	`None`
`assets`	`list[Table \| str] \| None`	Optional asset tables/names that can be referenced by this feature.	`None`
`metadata`	`list[ColumnDefinition \| Table \| Key \| str] \| None`	Optional columns, tables, or keys to include in a feature definition.	`None`
`optional`	`list[str] \| None`	Column names that are not required when creating feature instances.	`None`
`comment`	`str`	Description of the feature's purpose and usage.	`''`
`update_navbar`	`bool`	If True (default), automatically updates the navigation bar to include the new feature table. Set to False during batch feature creation to avoid redundant updates, then call apply_catalog_annotations() once at the end.	`True`

Returns:

Type	Description
`type[FeatureRecord]`	type[FeatureRecord]: A dynamically generated Pydantic model class for creating validated feature instances. The class has fields corresponding to the feature's terms, assets, and metadata columns.

Raises:

Type	Description
`DerivaMLException`	If a feature definition is invalid or conflicts with existing features.

Examples:

Create a feature with confidence score: >>> DiagnosisFeature = ml.create_feature( # doctest: +SKIP ... target_table="Image", ... feature_name="Diagnosis", ... terms=["Diagnosis_Type"], ... metadata=[ColumnDefinition(name="confidence", type=BuiltinTypes.float4)], ... comment="Clinical diagnosis label" ... ) >>> # Use the returned class to create validated feature instances >>> record = DiagnosisFeature( # doctest: +SKIP ... Image="1-ABC", # Target record RID ... Diagnosis_Type="Normal", # Vocabulary term ... confidence=0.95, ... Execution="2-XYZ" # Execution that produced this value ... )

Source code in src/deriva_ml/core/mixins/feature.py

def create_feature(
    self,
    target_table: Table | str,
    feature_name: str,
    terms: list[Table | str] | None = None,
    assets: list[Table | str] | None = None,
    metadata: list[ColumnDefinition | Table | Key | str] | None = None,
    optional: list[str] | None = None,
    comment: str = "",
    update_navbar: bool = True,
) -> type[FeatureRecord]:
    """Creates a new feature definition.

    A feature represents a measurable property or characteristic that can be associated with records in the target
    table. Features can include vocabulary terms, asset references, and additional metadata.

    **Side Effects**:
    This method dynamically creates:
    1. A new association table in the domain schema to store feature values
    2. A Pydantic model class (subclass of FeatureRecord) for creating validated feature instances

    The returned Pydantic model class provides type-safe construction of feature records with
    automatic validation of values against the feature's definition (vocabulary terms, asset
    references, etc.). Use this class to create feature instances that can be inserted into
    the catalog.

    Args:
        target_table: Table to associate the feature with (name or Table object).
        feature_name: Unique name for the feature within the target table.
        terms: Optional vocabulary tables/names whose terms can be used as feature values.
        assets: Optional asset tables/names that can be referenced by this feature.
        metadata: Optional columns, tables, or keys to include in a feature definition.
        optional: Column names that are not required when creating feature instances.
        comment: Description of the feature's purpose and usage.
        update_navbar: If True (default), automatically updates the navigation bar to include
            the new feature table. Set to False during batch feature creation to avoid
            redundant updates, then call apply_catalog_annotations() once at the end.

    Returns:
        type[FeatureRecord]: A dynamically generated Pydantic model class for creating
            validated feature instances. The class has fields corresponding to the feature's
            terms, assets, and metadata columns.

    Raises:
        DerivaMLException: If a feature definition is invalid or conflicts with existing features.

    Examples:
        Create a feature with confidence score:
            >>> DiagnosisFeature = ml.create_feature(  # doctest: +SKIP
            ...     target_table="Image",
            ...     feature_name="Diagnosis",
            ...     terms=["Diagnosis_Type"],
            ...     metadata=[ColumnDefinition(name="confidence", type=BuiltinTypes.float4)],
            ...     comment="Clinical diagnosis label"
            ... )
            >>> # Use the returned class to create validated feature instances
            >>> record = DiagnosisFeature(  # doctest: +SKIP
            ...     Image="1-ABC",  # Target record RID
            ...     Diagnosis_Type="Normal",  # Vocabulary term
            ...     confidence=0.95,
            ...     Execution="2-XYZ"  # Execution that produced this value
            ... )
    """
    # Initialize empty collections if None provided
    terms = terms or []
    assets = assets or []
    metadata = metadata or []
    optional = optional or []

    def normalize_metadata(m: Key | Table | ColumnDefinition | str | dict) -> Key | Table | dict:
        """Helper function to normalize metadata references.

        Handles:
        - str: Table name, converted to Table object
        - ColumnDefinition: Dataclass with to_dict() method
        - dict: Already in dict format (from Column.define())
        - Key/Table: Passed through unchanged
        """
        if isinstance(m, str):
            return self.model.name_to_table(m)
        elif isinstance(m, dict):
            # Already a dict (e.g., from Column.define())
            return m
        elif hasattr(m, "to_dict"):
            # ColumnDefinition or similar dataclass
            return m.to_dict()
        else:
            return m

    # Validate asset and term tables. Surface the offending
    # table names in the error so a user passing the wrong
    # parameter doesn't have to bisect a list to find the
    # bad entry.
    bad_assets = [a for a in assets if not self.model.is_asset(a)]
    if bad_assets:
        raise DerivaMLException(
            f"Invalid create_feature asset table(s): {bad_assets}. "
            "Each entry of `assets` must be a registered asset table."
        )
    bad_terms = [t for t in terms if not self.model.is_vocabulary(t)]
    if bad_terms:
        raise DerivaMLException(
            f"Invalid create_feature vocabulary table(s): {bad_terms}. "
            "Each entry of `terms` must be a controlled vocabulary table."
        )

    # Get references to required tables
    target_table = self.model.name_to_table(target_table)
    execution = self.model.schemas[self.ml_schema].tables["Execution"]
    feature_name_table = self.model.schemas[self.ml_schema].tables["Feature_Name"]

    # Add feature name to vocabulary
    feature_name_term = self.add_term("Feature_Name", feature_name, description=comment)
    atable_name = f"Execution_{target_table.name}_{feature_name_term.name}"
    # Create an association table implementing the feature
    atable = self.model.create_table(
        self.model._define_association(
            table_name=atable_name,
            associates=[execution, target_table, feature_name_table],
            metadata=[normalize_metadata(m) for m in chain(assets, terms, metadata)],
            comment=comment,
        )
    )
    # Configure optional columns and default feature name
    for c in optional:
        atable.columns[c].alter(nullok=True)
    atable.columns["Feature_Name"].alter(default=feature_name_term.name)

    # Update navbar to include the new feature table
    if update_navbar:
        self.apply_catalog_annotations()

    # Return feature record class for creating instances
    return self.feature_record_class(target_table, feature_name)

create_table

create_table(
    table: TableDefinition,
    schema: str | None = None,
    update_navbar: bool = True,
) -> Table

Creates a new table in the domain schema.

Creates a table using the provided TableDefinition object, which specifies the table structure including columns, keys, and foreign key relationships. The table is created in the domain schema associated with this DerivaML instance.

Required Classes: Import the following classes from deriva_ml to define tables:

TableDefinition: Defines the complete table structure
ColumnDefinition: Defines individual columns with types and constraints
KeyDefinition: Defines unique key constraints (optional)
ForeignKeyDefinition: Defines foreign key relationships to other tables (optional)
BuiltinTypes: Enum of available column data types

Available Column Types (BuiltinTypes enum): text, int2, int4, int8, float4, float8, boolean, date, timestamp, timestamptz, json, jsonb, markdown, ermrest_uri, ermrest_rid, ermrest_rcb, ermrest_rmb, ermrest_rct, ermrest_rmt

Parameters:

Name	Type	Description	Default
`table`	`TableDefinition`	A TableDefinition object containing the complete specification of the table to create.	required
`update_navbar`	`bool`	If True (default), automatically updates the navigation bar to include the new table. Set to False during batch table creation to avoid redundant updates, then call apply_catalog_annotations() once at the end.	`True`

Returns:

Name	Type	Description
`Table`	`Table`	The newly created ERMRest table object.

Raises:

Type	Description
`DerivaMLException`	If table creation fails or the definition is invalid.

Examples:

Simple table with basic columns:

>>> from deriva_ml import TableDefinition, ColumnDefinition, BuiltinTypes  # doctest: +SKIP
>>>
>>> table_def = TableDefinition(  # doctest: +SKIP
...     name="Experiment",
...     column_defs=[
...         ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
...         ColumnDefinition(name="Date", type=BuiltinTypes.date),
...         ColumnDefinition(name="Description", type=BuiltinTypes.markdown),
...         ColumnDefinition(name="Score", type=BuiltinTypes.float4),
...     ],
...     comment="Records of experimental runs"
... )
>>> experiment_table = ml.create_table(table_def)  # doctest: +SKIP

Table with foreign key to another table:

>>> from deriva_ml import (  # doctest: +SKIP
...     TableDefinition, ColumnDefinition, ForeignKeyDefinition, BuiltinTypes
... )
>>>
>>> # Create a Sample table that references Subject
>>> sample_def = TableDefinition(  # doctest: +SKIP
...     name="Sample",
...     column_defs=[
...         ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
...         ColumnDefinition(name="Subject", type=BuiltinTypes.text, nullok=False),
...         ColumnDefinition(name="Collection_Date", type=BuiltinTypes.date),
...     ],
...     fkey_defs=[
...         ForeignKeyDefinition(
...             colnames=["Subject"],
...             pk_sname=ml.default_schema,  # Schema of referenced table
...             pk_tname="Subject",          # Name of referenced table
...             pk_colnames=["RID"],         # Column(s) in referenced table
...             on_delete="CASCADE",         # Delete samples when subject deleted
...         )
...     ],
...     comment="Biological samples collected from subjects"
... )
>>> sample_table = ml.create_table(sample_def)  # doctest: +SKIP

Table with unique key constraint:

>>> from deriva_ml import (  # doctest: +SKIP
...     TableDefinition, ColumnDefinition, KeyDefinition, BuiltinTypes
... )
>>>
>>> protocol_def = TableDefinition(  # doctest: +SKIP
...     name="Protocol",
...     column_defs=[
...         ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
...         ColumnDefinition(name="Version", type=BuiltinTypes.text, nullok=False),
...         ColumnDefinition(name="Description", type=BuiltinTypes.markdown),
...     ],
...     key_defs=[
...         KeyDefinition(
...             colnames=["Name", "Version"],
...             constraint_names=[["myschema", "Protocol_Name_Version_key"]],
...             comment="Each protocol name+version must be unique"
...         )
...     ],
...     comment="Experimental protocols with versioning"
... )
>>> protocol_table = ml.create_table(protocol_def)  # doctest: +SKIP

Batch creation without navbar updates:

>>> ml.create_table(table1_def, update_navbar=False)  # doctest: +SKIP
>>> ml.create_table(table2_def, update_navbar=False)  # doctest: +SKIP
>>> ml.create_table(table3_def, update_navbar=False)  # doctest: +SKIP
>>> ml.apply_catalog_annotations()  # Update navbar once at the end  # doctest: +SKIP

Source code in src/deriva_ml/core/base.py

def create_table(self, table: TableDefinition, schema: str | None = None, update_navbar: bool = True) -> Table:
    """Creates a new table in the domain schema.

    Creates a table using the provided TableDefinition object, which specifies the table structure
    including columns, keys, and foreign key relationships. The table is created in the domain
    schema associated with this DerivaML instance.

    **Required Classes**:
    Import the following classes from deriva_ml to define tables:

    - ``TableDefinition``: Defines the complete table structure
    - ``ColumnDefinition``: Defines individual columns with types and constraints
    - ``KeyDefinition``: Defines unique key constraints (optional)
    - ``ForeignKeyDefinition``: Defines foreign key relationships to other tables (optional)
    - ``BuiltinTypes``: Enum of available column data types

    **Available Column Types** (BuiltinTypes enum):
    ``text``, ``int2``, ``int4``, ``int8``, ``float4``, ``float8``, ``boolean``,
    ``date``, ``timestamp``, ``timestamptz``, ``json``, ``jsonb``, ``markdown``,
    ``ermrest_uri``, ``ermrest_rid``, ``ermrest_rcb``, ``ermrest_rmb``,
    ``ermrest_rct``, ``ermrest_rmt``

    Args:
        table: A TableDefinition object containing the complete specification of the table to create.
        update_navbar: If True (default), automatically updates the navigation bar to include
            the new table. Set to False during batch table creation to avoid redundant updates,
            then call apply_catalog_annotations() once at the end.

    Returns:
        Table: The newly created ERMRest table object.

    Raises:
        DerivaMLException: If table creation fails or the definition is invalid.

    Examples:
        **Simple table with basic columns**:

            >>> from deriva_ml import TableDefinition, ColumnDefinition, BuiltinTypes  # doctest: +SKIP
            >>>
            >>> table_def = TableDefinition(  # doctest: +SKIP
            ...     name="Experiment",
            ...     column_defs=[
            ...         ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
            ...         ColumnDefinition(name="Date", type=BuiltinTypes.date),
            ...         ColumnDefinition(name="Description", type=BuiltinTypes.markdown),
            ...         ColumnDefinition(name="Score", type=BuiltinTypes.float4),
            ...     ],
            ...     comment="Records of experimental runs"
            ... )
            >>> experiment_table = ml.create_table(table_def)  # doctest: +SKIP

        **Table with foreign key to another table**:

            >>> from deriva_ml import (  # doctest: +SKIP
            ...     TableDefinition, ColumnDefinition, ForeignKeyDefinition, BuiltinTypes
            ... )
            >>>
            >>> # Create a Sample table that references Subject
            >>> sample_def = TableDefinition(  # doctest: +SKIP
            ...     name="Sample",
            ...     column_defs=[
            ...         ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
            ...         ColumnDefinition(name="Subject", type=BuiltinTypes.text, nullok=False),
            ...         ColumnDefinition(name="Collection_Date", type=BuiltinTypes.date),
            ...     ],
            ...     fkey_defs=[
            ...         ForeignKeyDefinition(
            ...             colnames=["Subject"],
            ...             pk_sname=ml.default_schema,  # Schema of referenced table
            ...             pk_tname="Subject",          # Name of referenced table
            ...             pk_colnames=["RID"],         # Column(s) in referenced table
            ...             on_delete="CASCADE",         # Delete samples when subject deleted
            ...         )
            ...     ],
            ...     comment="Biological samples collected from subjects"
            ... )
            >>> sample_table = ml.create_table(sample_def)  # doctest: +SKIP

        **Table with unique key constraint**:

            >>> from deriva_ml import (  # doctest: +SKIP
            ...     TableDefinition, ColumnDefinition, KeyDefinition, BuiltinTypes
            ... )
            >>>
            >>> protocol_def = TableDefinition(  # doctest: +SKIP
            ...     name="Protocol",
            ...     column_defs=[
            ...         ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
            ...         ColumnDefinition(name="Version", type=BuiltinTypes.text, nullok=False),
            ...         ColumnDefinition(name="Description", type=BuiltinTypes.markdown),
            ...     ],
            ...     key_defs=[
            ...         KeyDefinition(
            ...             colnames=["Name", "Version"],
            ...             constraint_names=[["myschema", "Protocol_Name_Version_key"]],
            ...             comment="Each protocol name+version must be unique"
            ...         )
            ...     ],
            ...     comment="Experimental protocols with versioning"
            ... )
            >>> protocol_table = ml.create_table(protocol_def)  # doctest: +SKIP

        **Batch creation without navbar updates**:

            >>> ml.create_table(table1_def, update_navbar=False)  # doctest: +SKIP
            >>> ml.create_table(table2_def, update_navbar=False)  # doctest: +SKIP
            >>> ml.create_table(table3_def, update_navbar=False)  # doctest: +SKIP
            >>> ml.apply_catalog_annotations()  # Update navbar once at the end  # doctest: +SKIP
    """
    # Use default schema if none specified
    schema = schema or self.model._require_default_schema()

    # Create table in domain schema using provided definition
    # Handle both TableDefinition (dataclass with to_dict) and plain dicts
    table_dict = table.to_dict() if hasattr(table, "to_dict") else table
    new_table = self.model.schemas[schema].create_table(table_dict)

    # Update navbar to include the new table
    if update_navbar:
        self.apply_catalog_annotations()

    return new_table

create_vocabulary

create_vocabulary(
    vocab_name: str,
    comment: str = "",
    schema: str | None = None,
    update_navbar: bool = True,
) -> Table

Creates a controlled vocabulary table.

A controlled vocabulary table maintains a list of standardized terms and their definitions. Each term can have synonyms and descriptions to ensure consistent terminology usage across the dataset.

Parameters:

Name	Type	Description	Default
`vocab_name`	`str`	Name for the new vocabulary table. Must be a valid SQL identifier.	required
`comment`	`str`	Description of the vocabulary's purpose and usage. Defaults to empty string.	`''`
`schema`	`str \| None`	Schema name to create the table in. If None, uses domain_schema.	`None`
`update_navbar`	`bool`	If True (default), automatically updates the navigation bar to include the new vocabulary table. Set to False during batch table creation to avoid redundant updates, then call apply_catalog_annotations() once at the end.	`True`

Returns:

Name	Type	Description
`Table`	`Table`	ERMRest table object representing the newly created vocabulary table.

Raises:

Type	Description
`DerivaMLException`	If vocab_name is invalid or already exists.

Examples:

Create a vocabulary for tissue types:

>>> table = ml.create_vocabulary(  # doctest: +SKIP
...     vocab_name="tissue_types",
...     comment="Standard tissue classifications",
...     schema="bio_schema"
... )

Create multiple vocabularies without updating navbar until the end:

>>> ml.create_vocabulary("Species", update_navbar=False)  # doctest: +SKIP
>>> ml.create_vocabulary("Tissue_Type", update_navbar=False)  # doctest: +SKIP
>>> ml.apply_catalog_annotations()  # Update navbar once  # doctest: +SKIP

Source code in src/deriva_ml/core/base.py

def create_vocabulary(
    self, vocab_name: str, comment: str = "", schema: str | None = None, update_navbar: bool = True
) -> Table:
    """Creates a controlled vocabulary table.

    A controlled vocabulary table maintains a list of standardized terms and their definitions. Each term can have
    synonyms and descriptions to ensure consistent terminology usage across the dataset.

    Args:
        vocab_name: Name for the new vocabulary table. Must be a valid SQL identifier.
        comment: Description of the vocabulary's purpose and usage. Defaults to empty string.
        schema: Schema name to create the table in. If None, uses domain_schema.
        update_navbar: If True (default), automatically updates the navigation bar to include
            the new vocabulary table. Set to False during batch table creation to avoid
            redundant updates, then call apply_catalog_annotations() once at the end.

    Returns:
        Table: ERMRest table object representing the newly created vocabulary table.

    Raises:
        DerivaMLException: If vocab_name is invalid or already exists.

    Examples:
        Create a vocabulary for tissue types:

            >>> table = ml.create_vocabulary(  # doctest: +SKIP
            ...     vocab_name="tissue_types",
            ...     comment="Standard tissue classifications",
            ...     schema="bio_schema"
            ... )

        Create multiple vocabularies without updating navbar until the end:

            >>> ml.create_vocabulary("Species", update_navbar=False)  # doctest: +SKIP
            >>> ml.create_vocabulary("Tissue_Type", update_navbar=False)  # doctest: +SKIP
            >>> ml.apply_catalog_annotations()  # Update navbar once  # doctest: +SKIP
    """
    # Use default schema if none specified
    schema = schema or self.model._require_default_schema()

    # Create and return vocabulary table with RID-based URI pattern
    try:
        vocab_table = self.model.schemas[schema].create_table(
            VocabularyTableDef(
                name=vocab_name,
                curie_template=f"{self.project_name}:{{RID}}",
                comment=comment,
            )
        )
    except ValueError:
        raise DerivaMLException(f"Table {vocab_name} already exist")

    # Update navbar to include the new vocabulary table
    if update_navbar:
        self.apply_catalog_annotations()

    return vocab_table

create_workflow

create_workflow(
    name: str,
    workflow_type: str | list[str],
    description: str = "",
) -> Workflow

Creates a new workflow definition.

Creates a Workflow object that represents a computational process or analysis pipeline. The workflow type(s) must be terms from the controlled vocabulary. This method is typically used to define new analysis workflows before execution.

Parameters:

Name	Type	Description	Default
`name`	`str`	Name of the workflow.	required
`workflow_type`	`str \| list[str]`	Type(s) of workflow (must exist in workflow_type vocabulary). Can be a single string or a list of strings.	required
`description`	`str`	Description of what the workflow does.	`''`

Returns:

Name	Type	Description
`Workflow`	`Workflow`	New workflow object ready for registration.

Raises:

Type	Description
`DerivaMLException`	If any workflow_type is not in the vocabulary.

Examples:

>>> workflow = ml.create_workflow(
...     name="RNA Analysis",
...     workflow_type="python_notebook",
...     description="RNA sequence analysis pipeline"
... )
>>> rid = ml._add_workflow(workflow)

Multiple types::

>>> workflow = ml.create_workflow(  # doctest: +SKIP
...     name="Training Pipeline",
...     workflow_type=["Training", "Embedding"],
...     description="Combined training and embedding pipeline"
... )

Source code in src/deriva_ml/core/mixins/workflow.py

def create_workflow(self, name: str, workflow_type: str | list[str], description: str = "") -> Workflow:
    """Creates a new workflow definition.

    Creates a Workflow object that represents a computational process or analysis pipeline. The workflow type(s)
    must be terms from the controlled vocabulary. This method is typically used to define new analysis
    workflows before execution.

    Args:
        name: Name of the workflow.
        workflow_type: Type(s) of workflow (must exist in workflow_type vocabulary).
            Can be a single string or a list of strings.
        description: Description of what the workflow does.

    Returns:
        Workflow: New workflow object ready for registration.

    Raises:
        DerivaMLException: If any workflow_type is not in the vocabulary.

    Examples:
        >>> workflow = ml.create_workflow(  # doctest: +SKIP
        ...     name="RNA Analysis",
        ...     workflow_type="python_notebook",
        ...     description="RNA sequence analysis pipeline"
        ... )
        >>> rid = ml._add_workflow(workflow)  # doctest: +SKIP

        Multiple types::

            >>> workflow = ml.create_workflow(  # doctest: +SKIP
            ...     name="Training Pipeline",
            ...     workflow_type=["Training", "Embedding"],
            ...     description="Combined training and embedding pipeline"
            ... )
    """
    # Normalize to list and validate each type exists in vocabulary
    types = [workflow_type] if isinstance(workflow_type, str) else workflow_type
    for wt in types:
        self.lookup_term(MLVocab.workflow_type, wt)

    # Create and return a new workflow object
    return Workflow(name=name, workflow_type=workflow_type, description=description)

define_association

define_association(
    associates: list,
    metadata: list | None = None,
    table_name: str | None = None,
    comment: str | None = None,
    **kwargs,
) -> dict

Build an association table definition with vocab-aware key selection.

Creates a table definition that links two or more tables via an association (many-to-many) table. Non-vocabulary tables automatically use RID as the foreign key target, while vocabulary tables use their Name key.

Use with create_table() to create the association table in the catalog.

Parameters:

Name	Type	Description	Default
`associates`	`list`	Tables to associate. Each item can be: - A Table object - A (name, Table) tuple to customize the column name - A (name, nullok, Table) tuple for nullable references - A Key object for explicit key selection	required
`metadata`	`list \| None`	Additional metadata columns or reference targets.	`None`
`table_name`	`str \| None`	Name for the association table. Auto-generated if omitted.	`None`
`comment`	`str \| None`	Comment for the association table.	`None`
`**kwargs`		Additional arguments passed to Table.define_association.	`{}`

Returns:

Type	Description
`dict`	Table definition dict suitable for `create_table()`.

Example::

# Associate Image with Subject (many-to-many)
image_table = ml.model.name_to_table("Image")
subject_table = ml.model.name_to_table("Subject")
assoc_def = ml.define_association(
    associates=[image_table, subject_table],
    comment="Links images to subjects",
)
ml.create_table(assoc_def)

Source code in src/deriva_ml/core/base.py

def define_association(
    self,
    associates: list,
    metadata: list | None = None,
    table_name: str | None = None,
    comment: str | None = None,
    **kwargs,
) -> dict:
    """Build an association table definition with vocab-aware key selection.

    Creates a table definition that links two or more tables via an association
    (many-to-many) table. Non-vocabulary tables automatically use RID as the
    foreign key target, while vocabulary tables use their Name key.

    Use with ``create_table()`` to create the association table in the catalog.

    Args:
        associates: Tables to associate. Each item can be:
            - A Table object
            - A (name, Table) tuple to customize the column name
            - A (name, nullok, Table) tuple for nullable references
            - A Key object for explicit key selection
        metadata: Additional metadata columns or reference targets.
        table_name: Name for the association table. Auto-generated if omitted.
        comment: Comment for the association table.
        **kwargs: Additional arguments passed to Table.define_association.

    Returns:
        Table definition dict suitable for ``create_table()``.

    Example::

        # Associate Image with Subject (many-to-many)
        image_table = ml.model.name_to_table("Image")
        subject_table = ml.model.name_to_table("Subject")
        assoc_def = ml.define_association(
            associates=[image_table, subject_table],
            comment="Links images to subjects",
        )
        ml.create_table(assoc_def)
    """
    return self.model._define_association(
        associates=associates,
        metadata=metadata,
        table_name=table_name,
        comment=comment,
        **kwargs,
    )

delete_dataset

delete_dataset(
    dataset: "Dataset",
    recurse: bool = False,
) -> None

Soft-delete a dataset by marking it as deleted in the catalog.

Sets the Deleted flag on the dataset record. The dataset's data is preserved but it will no longer appear in normal queries (e.g., find_datasets()). The dataset cannot be deleted if it is currently nested inside a parent dataset.

Parameters:

Name	Type	Description	Default
`dataset`	`Dataset`	The dataset to delete.	required
`recurse`	`bool`	If True, also soft-delete all nested child datasets. If False (default), only this dataset is marked as deleted.	`False`

Raises:

Type	Description
`DerivaMLException`	If the dataset RID is not a valid dataset, or if the dataset is nested inside a parent dataset.

Example

ds = ml.lookup_dataset("1-ABC") # doctest: +SKIP ml.delete_dataset(ds, recurse=False) # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/dataset.py

def delete_dataset(self, dataset: "Dataset", recurse: bool = False) -> None:
    """Soft-delete a dataset by marking it as deleted in the catalog.

    Sets the ``Deleted`` flag on the dataset record. The dataset's data is
    preserved but it will no longer appear in normal queries (e.g.,
    ``find_datasets()``). The dataset cannot be deleted if it is currently
    nested inside a parent dataset.

    Args:
        dataset (Dataset): The dataset to delete.
        recurse (bool): If True, also soft-delete all nested child datasets.
            If False (default), only this dataset is marked as deleted.

    Raises:
        DerivaMLException: If the dataset RID is not a valid dataset, or if the
            dataset is nested inside a parent dataset.

    Example:
        >>> ds = ml.lookup_dataset("1-ABC")  # doctest: +SKIP
        >>> ml.delete_dataset(ds, recurse=False)  # doctest: +SKIP
    """
    # Get association table entries for this dataset_table
    # Delete association table entries
    dataset_rid = dataset.dataset_rid
    if not self.model.is_dataset_rid(dataset.dataset_rid):
        raise DerivaMLException("Dataset_rid is not a dataset.")

    if parents := dataset.list_dataset_parents():
        raise DerivaMLException(f'Dataset "{dataset}" is in a nested dataset: {parents}.')

    pb = self.pathBuilder()
    dataset_path = pb.schemas[self._dataset_table.schema.name].tables[self._dataset_table.name]

    # list_dataset_children returns Dataset objects, so extract their RIDs
    child_rids = [ds.dataset_rid for ds in dataset.list_dataset_children()] if recurse else []
    rid_list = [dataset_rid] + child_rids
    dataset_path.update([{"RID": r, "Deleted": True} for r in rid_list])

delete_feature

delete_feature(
    table: Table | str,
    feature_name: str,
) -> bool

Removes a feature definition and its data.

Deletes the feature and its implementation table from the catalog. This operation cannot be undone and will remove all feature values associated with this feature.

Parameters:

Name	Type	Description	Default
`table`	`Table \| str`	The table containing the feature, either as name or Table object.	required
`feature_name`	`str`	Name of the feature to delete.	required

Returns:

Name	Type	Description
`bool`	`bool`	True if the feature was successfully deleted, False if it didn't exist.

Raises:

Type	Description
`DerivaMLException`	If deletion fails due to constraints or permissions.

Example

success = ml.delete_feature("samples", "obsolete_feature") # doctest: +SKIP print("Deleted" if success else "Not found") # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/feature.py

def delete_feature(self, table: Table | str, feature_name: str) -> bool:
    """Removes a feature definition and its data.

    Deletes the feature and its implementation table from the catalog. This operation cannot be undone and
    will remove all feature values associated with this feature.

    Args:
        table: The table containing the feature, either as name or Table object.
        feature_name: Name of the feature to delete.

    Returns:
        bool: True if the feature was successfully deleted, False if it didn't exist.

    Raises:
        DerivaMLException: If deletion fails due to constraints or permissions.

    Example:
        >>> success = ml.delete_feature("samples", "obsolete_feature")  # doctest: +SKIP
        >>> print("Deleted" if success else "Not found")  # doctest: +SKIP
    """
    # Get table reference and find feature
    table = self.model.name_to_table(table)
    try:
        # Find and delete the feature's implementation table
        feature = next(f for f in self.model.find_features(table) if f.feature_name == feature_name)
        feature.feature_table.drop()
        return True
    except StopIteration:
        return False

delete_term

delete_term(
    table: str | Table, term_name: str
) -> None

Delete a term from a vocabulary table.

Removes a term from the vocabulary. The term must not be in use by any records in the catalog (e.g., no datasets using this dataset type, no assets using this asset type).

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Vocabulary table containing the term (name or Table object).	required
`term_name`	`str`	Primary name of the term to delete.	required

Raises:

Type	Description
`DerivaMLInvalidTerm`	If the term doesn't exist in the vocabulary.
`DerivaMLException`	If the term is currently in use by other records.

Example

ml.delete_term("Dataset_Type", "Obsolete_Type") # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/vocabulary.py

@validate_call(config=VALIDATION_CONFIG)
def delete_term(self, table: str | Table, term_name: str) -> None:
    """Delete a term from a vocabulary table.

    Removes a term from the vocabulary. The term must not be in use by any
    records in the catalog (e.g., no datasets using this dataset type, no
    assets using this asset type).

    Args:
        table: Vocabulary table containing the term (name or Table object).
        term_name: Primary name of the term to delete.

    Raises:
        DerivaMLInvalidTerm: If the term doesn't exist in the vocabulary.
        DerivaMLException: If the term is currently in use by other records.

    Example:
        >>> ml.delete_term("Dataset_Type", "Obsolete_Type")  # doctest: +SKIP
    """
    # Look up the term (validates table and term existence)
    term = self.lookup_term(table, term_name)
    vocab_table = self.model.name_to_table(table)

    # Check if the term is in use by examining association tables
    associations = list(vocab_table.find_associations())
    pb = self.pathBuilder()

    for assoc in associations:
        assoc_path = pb.schemas[assoc.schema.name].tables[assoc.name]
        # Check if any rows reference this term
        count = len(list(assoc_path.filter(getattr(assoc_path, vocab_table.name) == term.name).entities().fetch()))
        if count > 0:
            raise DerivaMLException(
                f"Cannot delete term '{term_name}' from {vocab_table.name}: "
                f"it is referenced by {count} record(s) in {assoc.name}"
            )

    # No references found - safe to delete
    table_path = pb.schemas[vocab_table.schema.name].tables[vocab_table.name]
    table_path.filter(table_path.RID == term.rid).delete()

    # Invalidate cache
    self.clear_vocabulary_cache(table)

diff_schema

diff_schema() -> 'SchemaDiff'

Return the structural diff between the cached and live schemas.

Online mode only. Fetches the live catalog's /schema payload, compares it against the cached copy with :func:~deriva_ml.core.schema_diff._compute_diff, and returns the result. The returned :class:SchemaDiff may be empty (no drift) — callers should check diff.is_empty() rather than truthiness.

Unlike :meth:pin_schema, this method never modifies the cache and never logs a warning; it is a pure inspection operation.

Returns:

Name	Type	Description
`A`	`'SchemaDiff'`	class:`SchemaDiff`, possibly empty.

Raises:

Type	Description
`DerivaMLOfflineError`	If called in offline mode.
`FileNotFoundError`	If the workspace has no cache file.

Source code in src/deriva_ml/core/base.py

def diff_schema(self) -> "SchemaDiff":
    """Return the structural diff between the cached and live schemas.

    Online mode only. Fetches the live catalog's ``/schema``
    payload, compares it against the cached copy with
    :func:`~deriva_ml.core.schema_diff._compute_diff`, and returns
    the result. The returned :class:`SchemaDiff` may be empty
    (no drift) — callers should check ``diff.is_empty()`` rather
    than truthiness.

    Unlike :meth:`pin_schema`, this method never modifies the
    cache and never logs a warning; it is a pure inspection
    operation.

    Returns:
        A :class:`SchemaDiff`, possibly empty.

    Raises:
        DerivaMLOfflineError: If called in offline mode.
        FileNotFoundError: If the workspace has no cache file.
    """
    from deriva_ml.core.schema_diff import _compute_diff

    if self._mode is not ConnectionMode.online:
        raise DerivaMLOfflineError("diff_schema requires online mode")
    cache = SchemaCache(self.working_dir)
    cached_payload = cache.load()
    # See refresh_schema for the purge+get rationale.
    self.catalog.purge_cache_by_prefix("/schema")
    live_schema = self.catalog.getCatalogSchema()
    return _compute_diff(cached_payload["schema"], live_schema)

download_dataset_bag

download_dataset_bag(
    dataset: DatasetSpec,
) -> "DatasetBag"

Downloads a dataset to the local filesystem.

Downloads a dataset specified by DatasetSpec to the local filesystem. If the catalog has s3_bucket configured and use_minid is enabled, the bag will be uploaded to S3 and registered with the MINID service.

Parameters:

Name	Type	Description	Default
`dataset`	`DatasetSpec`	Specification of the dataset to download, including version and materialization options.	required

Returns:

Name	Type	Description
`DatasetBag`	`'DatasetBag'`	Object containing: - path: Local filesystem path to downloaded dataset - rid: Dataset's Resource Identifier - minid: Dataset's Minimal Viable Identifier (if MINID enabled)

Note

MINID support requires s3_bucket to be configured when creating the DerivaML instance. The catalog's use_minid setting controls whether MINIDs are created.

Examples:

Download with default options: >>> spec = DatasetSpec(rid="1-abc123") # doctest: +SKIP >>> bag = ml.download_dataset_bag(dataset=spec) # doctest: +SKIP >>> print(f"Downloaded to {bag.path}") # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/dataset.py

def download_dataset_bag(
    self,
    dataset: DatasetSpec,
) -> "DatasetBag":
    """Downloads a dataset to the local filesystem.

    Downloads a dataset specified by DatasetSpec to the local filesystem. If the catalog
    has s3_bucket configured and use_minid is enabled, the bag will be uploaded to S3
    and registered with the MINID service.

    Args:
        dataset: Specification of the dataset to download, including version and materialization options.

    Returns:
        DatasetBag: Object containing:
            - path: Local filesystem path to downloaded dataset
            - rid: Dataset's Resource Identifier
            - minid: Dataset's Minimal Viable Identifier (if MINID enabled)

    Note:
        MINID support requires s3_bucket to be configured when creating the DerivaML instance.
        The catalog's use_minid setting controls whether MINIDs are created.

    Examples:
        Download with default options:
            >>> spec = DatasetSpec(rid="1-abc123")  # doctest: +SKIP
            >>> bag = ml.download_dataset_bag(dataset=spec)  # doctest: +SKIP
            >>> print(f"Downloaded to {bag.path}")  # doctest: +SKIP
    """
    if not self.model.is_dataset_rid(dataset.rid):
        raise DerivaMLTableTypeError("Dataset", dataset.rid)
    ds = self.lookup_dataset(dataset)
    return ds.download_dataset_bag(
        version=dataset.version,
        materialize=dataset.materialize,
        use_minid=self.use_minid,
        exclude_tables=dataset.exclude_tables,
        timeout=dataset.timeout,
        fetch_concurrency=dataset.fetch_concurrency,
    )

download_dir

download_dir(
    cached: bool = False,
) -> Path

Returns the appropriate download directory.

Provides the appropriate directory path for storing downloaded files, either in the cache or working directory.

Parameters:

Name	Type	Description	Default
`cached`	`bool`	If True, returns the cache directory path. If False, returns the working directory path.	`False`

Returns:

Name	Type	Description
`Path`	`Path`	Directory path where downloaded files should be stored.

Example

cache_dir = ml.download_dir(cached=True) # doctest: +SKIP work_dir = ml.download_dir(cached=False) # doctest: +SKIP

Source code in src/deriva_ml/core/base.py

def download_dir(self, cached: bool = False) -> Path:
    """Returns the appropriate download directory.

    Provides the appropriate directory path for storing downloaded files, either in the cache or working directory.

    Args:
        cached: If True, returns the cache directory path. If False, returns the working directory path.

    Returns:
        Path: Directory path where downloaded files should be stored.

    Example:
        >>> cache_dir = ml.download_dir(cached=True)  # doctest: +SKIP
        >>> work_dir = ml.download_dir(cached=False)  # doctest: +SKIP
    """
    # Return cache directory if cached=True, otherwise working directory
    return self.cache_dir if cached else self.working_dir

estimate_bag_size

estimate_bag_size(
    dataset: "DatasetSpec",
) -> dict[str, Any]

Estimate the size of a dataset bag before downloading.

Generates the same download specification used by download_dataset_bag, then runs COUNT and SUM(Length) queries against the snapshot catalog to preview what a download will contain and how large it will be.

Parameters:

Name	Type	Description	Default
`dataset`	`'DatasetSpec'`	Specification of the dataset to estimate, including version and optional exclude_tables.	required

Returns:

Type	Description
`dict[str, Any]`	dict with keys: - tables: dict mapping table name to {row_count, is_asset, asset_bytes} - total_rows: total row count across all tables - total_asset_bytes: total size of asset files in bytes - total_asset_size: human-readable size string (e.g., "1.2 GB")

Source code in src/deriva_ml/core/mixins/dataset.py

def estimate_bag_size(
    self,
    dataset: "DatasetSpec",
) -> dict[str, Any]:
    """Estimate the size of a dataset bag before downloading.

    Generates the same download specification used by download_dataset_bag,
    then runs COUNT and SUM(Length) queries against the snapshot catalog
    to preview what a download will contain and how large it will be.

    Args:
        dataset: Specification of the dataset to estimate, including version
            and optional exclude_tables.

    Returns:
        dict with keys:
            - tables: dict mapping table name to {row_count, is_asset, asset_bytes}
            - total_rows: total row count across all tables
            - total_asset_bytes: total size of asset files in bytes
            - total_asset_size: human-readable size string (e.g., "1.2 GB")
    """
    if not self.model.is_dataset_rid(dataset.rid):
        raise DerivaMLTableTypeError("Dataset", dataset.rid)
    ds = self.lookup_dataset(dataset)
    return ds.estimate_bag_size(
        version=dataset.version,
        exclude_tables=dataset.exclude_tables,
    )

estimate_denormalized_size

estimate_denormalized_size(
    include_tables: list[str],
) -> dict[str, Any]

Return schema shape + catalog-wide size estimates for a denormalized table.

This is the catalog-wide analog of :meth:Dataset.describe_denormalized. It asks "if I were to denormalize these tables across the entire catalog (not scoped to any specific dataset), what would the result look like and how big would it be?" Useful for rough size estimation before committing to a bag export.

The return shape is aligned with :meth:estimate_bag_size and is NOT the same as the dataset-scoped 13-key plan dict from :meth:Dataset.describe_denormalized (docs/user-guide/denormalization.md §8.3.2). Do not confuse the two.

Parameters:

Name	Type	Description	Default
`include_tables`	`list[str]`	List of table names to include in the join.	required

Returns:

Type	Description
`dict[str, Any]`	dict with these keys:
`dict[str, Any]`	`columns`: list of `(column_name, column_type)` tuples.
`dict[str, Any]`	`join_path`: ordered list of domain table names on the join chain (excludes the implicit `Dataset` root and any association tables).
`dict[str, Any]`	`tables`: `{table_name: {row_count, is_asset, asset_bytes}}` — per-table stats for every table in the join path.
`dict[str, Any]`	`total_rows`: sum of `row_count` across all included tables.
`dict[str, Any]`	`total_asset_bytes`: sum of `asset_bytes`.
`dict[str, Any]`	`total_asset_size`: human-readable byte-count string (e.g., `"1.2 GB"`).

Example::

info = ml.estimate_denormalized_size(["Image", "Subject"])
print(f"{info['total_rows']} rows across "
      f"{len(info['tables'])} tables, "
      f"{info['total_asset_size']} of assets")

feature_record_class

feature_record_class(
    table: str | Table,
    feature_name: str,
) -> type[FeatureRecord]

Returns a dynamically generated Pydantic model class for creating feature records.

Each feature has a unique set of columns based on its definition (terms, assets, metadata). This method returns a Pydantic class with fields corresponding to those columns, providing:

Type validation: Values are validated against expected types (str, int, float, Path)
Required field checking: Non-nullable columns must be provided
Default values: Feature_Name is pre-filled with the feature's name

Field types in the generated class: - {TargetTable} (str): Required. RID of the target record (e.g., Image RID) - Execution (str, optional): RID of the execution for provenance tracking - Feature_Name (str): Pre-filled with the feature name - Term columns (str): Accept vocabulary term names - Asset columns (str | Path): Accept asset RIDs or file paths - Value columns: Accept values matching the column type (int, float, str)

Use lookup_feature() to inspect the feature's structure and see what columns are available.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	The table containing the feature, either as name or Table object.	required
`feature_name`	`str`	Name of the feature to create a record class for.	required

Returns:

Type	Description
`type[FeatureRecord]`	type[FeatureRecord]: A Pydantic model class for creating validated feature records. The class name follows the pattern `{TargetTable}Feature{FeatureName}`.

Raises:

Type	Description
`DerivaMLException`	If the feature doesn't exist or the table is invalid.

Example

Get the dynamically generated class

DiagnosisFeature = ml.feature_record_class("Image", "Diagnosis") # doctest: +SKIP

Create a validated feature record

record = DiagnosisFeature( # doctest: +SKIP ... Image="1-ABC", # Target record RID ... Diagnosis_Type="Normal", # Vocabulary term ... confidence=0.95, # Metadata column ... Execution="2-XYZ" # Provenance ... )

Convert to dict for insertion

record.model_dump() # doctest: +SKIP {'Image': '1-ABC', 'Diagnosis_Type': 'Normal', 'confidence': 0.95, ...}

Source code in src/deriva_ml/core/mixins/feature.py

def feature_record_class(self, table: str | Table, feature_name: str) -> type[FeatureRecord]:
    """Returns a dynamically generated Pydantic model class for creating feature records.

    Each feature has a unique set of columns based on its definition (terms, assets, metadata).
    This method returns a Pydantic class with fields corresponding to those columns, providing:

    - **Type validation**: Values are validated against expected types (str, int, float, Path)
    - **Required field checking**: Non-nullable columns must be provided
    - **Default values**: Feature_Name is pre-filled with the feature's name

    **Field types in the generated class:**
    - `{TargetTable}` (str): Required. RID of the target record (e.g., Image RID)
    - `Execution` (str, optional): RID of the execution for provenance tracking
    - `Feature_Name` (str): Pre-filled with the feature name
    - Term columns (str): Accept vocabulary term names
    - Asset columns (str | Path): Accept asset RIDs or file paths
    - Value columns: Accept values matching the column type (int, float, str)

    Use `lookup_feature()` to inspect the feature's structure and see what columns
    are available.

    Args:
        table: The table containing the feature, either as name or Table object.
        feature_name: Name of the feature to create a record class for.

    Returns:
        type[FeatureRecord]: A Pydantic model class for creating validated feature records.
            The class name follows the pattern `{TargetTable}Feature{FeatureName}`.

    Raises:
        DerivaMLException: If the feature doesn't exist or the table is invalid.

    Example:
        >>> # Get the dynamically generated class
        >>> DiagnosisFeature = ml.feature_record_class("Image", "Diagnosis")  # doctest: +SKIP
        >>>
        >>> # Create a validated feature record
        >>> record = DiagnosisFeature(  # doctest: +SKIP
        ...     Image="1-ABC",           # Target record RID
        ...     Diagnosis_Type="Normal", # Vocabulary term
        ...     confidence=0.95,         # Metadata column
        ...     Execution="2-XYZ"        # Provenance
        ... )
        >>>
        >>> # Convert to dict for insertion
        >>> record.model_dump()  # doctest: +SKIP
        {'Image': '1-ABC', 'Diagnosis_Type': 'Normal', 'confidence': 0.95, ...}
    """
    # Look up a feature and return its record class
    return self.lookup_feature(table, feature_name).feature_record_class()

feature_values

feature_values(
    table: Table | str,
    feature_name: str,
    selector: Callable[
        [list[FeatureRecord]],
        FeatureRecord | None,
    ]
    | None = None,
    materialize_limit: int
    | None = None,
    execution_rids: list[str]
    | None = None,
) -> Iterable[FeatureRecord]

Yield feature values for a single feature, one record per target RID.

Returns an iterator of typed FeatureRecord instances. Each record is wide in shape — target RID, all value columns (vocab terms, asset references, metadata columns), and provenance columns (Execution, RCT) — exposed as typed attributes.

When a selector is provided, records are grouped by target RID and the selector collapses each group to a single survivor. Target RIDs whose group's selector returns None are omitted. When no selector is provided, every raw record is yielded — multiple records per target RID are possible.

This method has identical signatures and semantics across DerivaML, Dataset, and DatasetBag. The bag implementation reads from a per-feature denormalization cache populated on first access; subsequent calls are cheap.

All rows for the feature are fetched from the catalog before the first record is yielded — this method is iterator-shaped for composability, not for streaming of very large feature tables. When execution_rids is set, the catalog query is filtered server-side to those execution RIDs only -- this is the recommended way to keep the materialization cost bounded for cross-execution comparisons.

Parameters:

Name	Type	Description	Default
`table`	`Table \| str`	Target table the feature is defined on (name or Table).	required
`feature_name`	`str`	Name of the feature to read.	required
`selector`	`Callable[[list[FeatureRecord]], FeatureRecord \| None] \| None`	Optional callable `(list[FeatureRecord]) -> FeatureRecord \| None` used to reduce multi-value groups. Built-ins include `FeatureRecord.select_newest`, `FeatureRecord.select_first`, and the factory `FeatureRecord.select_by_workflow(workflow, container=...)`. Return `None` from a selector to omit that target RID.	`None`
`materialize_limit`	`int \| None`	Optional cap on the number of rows that may be materialized into memory. When the catalog query returns more than this many rows, raises `DerivaMLMaterializeLimitExceeded`. Default `None` preserves the existing unbounded behavior; callers driving Python directly opt into responsibility for memory management. The `deriva-ml-mcp` plugin sets a default to keep MCP responses bounded.	`None`
`execution_rids`	`list[str] \| None`	Optional filter -- when set, only feature rows whose `Execution` value is in this list are materialized. Lets callers compare metric values across a known set of executions in a single catalog round-trip rather than N sequential queries. Empty list short-circuits to an empty result.	`None`

Returns:

Type	Description
`Iterable[FeatureRecord]`	Iterator of `FeatureRecord` — one record per target RID after
`Iterable[FeatureRecord]`	selector reduction, or all raw records if no selector.

Raises:

Type	Description
`DerivaMLTableNotFound`	`table` does not exist.
`DerivaMLException`	`feature_name` is not a feature on `table`.
`DerivaMLMaterializeLimitExceeded`	If the result set exceeds `materialize_limit`.

Example

Get the newest Glaucoma label per image::

>>> from deriva_ml import DerivaML  # doctest: +SKIP
>>> from deriva_ml.feature import FeatureRecord  # doctest: +SKIP
>>> for rec in ml.feature_values(  # doctest: +SKIP
...     "Image", "Glaucoma", selector=FeatureRecord.select_newest,
... ):
...     print(f"{rec.Image}: {rec.Glaucoma} (by {rec.Execution})")

Filter by a specific workflow — works identically on a downloaded bag. select_by_workflow takes a str (Workflow_Type name or Workflow RID); the underlying list_workflow_executions call is @validate_call'd to str and will reject a Workflow object outright::

>>> # By Workflow_Type name (the common case):
>>> sel = FeatureRecord.select_by_workflow(  # doctest: +SKIP
...     "Glaucoma_Training_v2", container=ml,
... )
>>> # ... or by Workflow RID, when you need a specific run:
>>> workflow = ml.lookup_workflow("Glaucoma_Training_v2")  # doctest: +SKIP
>>> sel = FeatureRecord.select_by_workflow(  # doctest: +SKIP
...     workflow.workflow_rid, container=ml,
... )
>>> labels = [r.Glaucoma for r in ml.feature_values(  # doctest: +SKIP
...     "Image", "Glaucoma", selector=sel,
... )]

Convert to a pandas DataFrame when needed::

>>> import pandas as pd  # doctest: +SKIP
>>> df = pd.DataFrame(  # doctest: +SKIP
...     r.model_dump()
...     for r in ml.feature_values("Image", "Glaucoma")
... )

Source code in src/deriva_ml/core/mixins/feature.py

@validate_call(config=VALIDATION_CONFIG)
def feature_values(
    self,
    table: Table | str,
    feature_name: str,
    selector: Callable[[list[FeatureRecord]], FeatureRecord | None] | None = None,
    materialize_limit: int | None = None,
    execution_rids: list[str] | None = None,
) -> Iterable[FeatureRecord]:
    """Yield feature values for a single feature, one record per target RID.

    Returns an iterator of typed ``FeatureRecord`` instances. Each record is
    wide in shape — target RID, all value columns (vocab terms, asset
    references, metadata columns), and provenance columns (``Execution``,
    ``RCT``) — exposed as typed attributes.

    When a ``selector`` is provided, records are grouped by target RID and
    the selector collapses each group to a single survivor. Target RIDs
    whose group's selector returns ``None`` are omitted. When no selector
    is provided, every raw record is yielded — multiple records per target
    RID are possible.

    This method has identical signatures and semantics across ``DerivaML``,
    ``Dataset``, and ``DatasetBag``. The bag implementation reads from a
    per-feature denormalization cache populated on first access; subsequent
    calls are cheap.

    All rows for the feature are fetched from the catalog before the first
    record is yielded — this method is iterator-shaped for composability,
    not for streaming of very large feature tables. When ``execution_rids``
    is set, the catalog query is filtered server-side to those execution
    RIDs only -- this is the recommended way to keep the materialization
    cost bounded for cross-execution comparisons.

    Args:
        table: Target table the feature is defined on (name or Table).
        feature_name: Name of the feature to read.
        selector: Optional callable
            ``(list[FeatureRecord]) -> FeatureRecord | None`` used to
            reduce multi-value groups. Built-ins include
            ``FeatureRecord.select_newest``,
            ``FeatureRecord.select_first``, and the factory
            ``FeatureRecord.select_by_workflow(workflow, container=...)``.
            Return ``None`` from a selector to omit that target RID.
        materialize_limit: Optional cap on the number of rows that
            may be materialized into memory. When the catalog query
            returns more than this many rows, raises
            ``DerivaMLMaterializeLimitExceeded``. Default ``None``
            preserves the existing unbounded behavior; callers
            driving Python directly opt into responsibility for
            memory management. The ``deriva-ml-mcp`` plugin sets a
            default to keep MCP responses bounded.
        execution_rids: Optional filter -- when set, only feature
            rows whose ``Execution`` value is in this list are
            materialized. Lets callers compare metric values
            across a known set of executions in a single
            catalog round-trip rather than N sequential queries.
            Empty list short-circuits to an empty result.

    Returns:
        Iterator of ``FeatureRecord`` — one record per target RID after
        selector reduction, or all raw records if no selector.

    Raises:
        DerivaMLTableNotFound: ``table`` does not exist.
        DerivaMLException: ``feature_name`` is not a feature on ``table``.
        DerivaMLMaterializeLimitExceeded: If the result set exceeds
            ``materialize_limit``.

    Example:
        Get the newest Glaucoma label per image::

            >>> from deriva_ml import DerivaML  # doctest: +SKIP
            >>> from deriva_ml.feature import FeatureRecord  # doctest: +SKIP
            >>> for rec in ml.feature_values(  # doctest: +SKIP
            ...     "Image", "Glaucoma", selector=FeatureRecord.select_newest,
            ... ):
            ...     print(f"{rec.Image}: {rec.Glaucoma} (by {rec.Execution})")

        Filter by a specific workflow — works identically on a downloaded bag.
        ``select_by_workflow`` takes a ``str`` (Workflow_Type name or
        Workflow RID); the underlying ``list_workflow_executions``
        call is ``@validate_call``'d to ``str`` and will reject a
        ``Workflow`` object outright::

            >>> # By Workflow_Type name (the common case):
            >>> sel = FeatureRecord.select_by_workflow(  # doctest: +SKIP
            ...     "Glaucoma_Training_v2", container=ml,
            ... )
            >>> # ... or by Workflow RID, when you need a specific run:
            >>> workflow = ml.lookup_workflow("Glaucoma_Training_v2")  # doctest: +SKIP
            >>> sel = FeatureRecord.select_by_workflow(  # doctest: +SKIP
            ...     workflow.workflow_rid, container=ml,
            ... )
            >>> labels = [r.Glaucoma for r in ml.feature_values(  # doctest: +SKIP
            ...     "Image", "Glaucoma", selector=sel,
            ... )]

        Convert to a pandas DataFrame when needed::

            >>> import pandas as pd  # doctest: +SKIP
            >>> df = pd.DataFrame(  # doctest: +SKIP
            ...     r.model_dump()
            ...     for r in ml.feature_values("Image", "Glaucoma")
            ... )
    """
    table_obj = self.model.name_to_table(table)
    feat = self.lookup_feature(table_obj, feature_name)
    record_class = feat.feature_record_class()
    field_names = set(record_class.model_fields.keys())
    target_col = feat.target_table.name

    # Fetch raw rows via datapath. Apply execution_rids filter
    # server-side to avoid materializing rows we'll discard.
    # Empty list short-circuits to an empty result.
    if execution_rids is not None and not execution_rids:
        return

    pb = self.pathBuilder()
    feature_path = pb.schemas[feat.feature_table.schema.name].tables[feat.feature_table.name]

    if execution_rids is not None:
        # Path-builder column wrappers don't expose .in_(); build the
        # IN-clause as a chained OR of equality predicates instead.
        predicates = [feature_path.Execution == rid for rid in execution_rids]
        feature_path = feature_path.filter(reduce(or_, predicates))

    raw_values = list(feature_path.entities().fetch())

    # Enforce the materialize_limit cap before record construction.
    if materialize_limit is not None and len(raw_values) > materialize_limit:
        raise DerivaMLMaterializeLimitExceeded(
            actual_count=len(raw_values),
            limit=materialize_limit,
        )

    # Materialize to FeatureRecord instances
    records: list[FeatureRecord] = [
        record_class(**{k: v for k, v in raw.items() if k in field_names}) for raw in raw_values
    ]

    if selector is None:
        # No reduction — yield everything.
        yield from records
        return

    # Group by feature identity, apply selector, skip None results.
    # Three sites (this one, Dataset.feature_values,
    # DatasetBag.feature_values) share the same reduction
    # shape; the helper pins it in one place. ``qualifier_columns``
    # carries the per-eye-style identity FKs (empty for ordinary
    # features → group-by-target, unchanged).
    from deriva_ml.feature import reduce_with_selector

    yield from reduce_with_selector(records, target_col, selector, feat.qualifier_columns)

find_assets

find_assets(
    asset_table: Table
    | str
    | None = None,
    asset_type: str | None = None,
) -> Iterable["Asset"]

Find assets in the catalog.

Returns an iterable of Asset objects matching the specified criteria. If no criteria are specified, returns all assets from all asset tables.

Parameters:

Name	Type	Description	Default
`asset_table`	`Table \| str \| None`	Optional table or table name to search. If None, searches all asset tables.	`None`
`asset_type`	`str \| None`	Optional asset type to filter by. Only returns assets with this type.	`None`

Returns:

Type	Description
`Iterable['Asset']`	Iterable of Asset objects matching the criteria.

Example

Find all assets in the Model table

models = list(ml.find_assets(asset_table="Model")) # doctest: +SKIP

Find all assets with type "Training_Data"

training = list(ml.find_assets(asset_type="Training_Data")) # doctest: +SKIP

Find all assets across all tables

all_assets = list(ml.find_assets()) # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/asset.py

def find_assets(
    self,
    asset_table: Table | str | None = None,
    asset_type: str | None = None,
) -> Iterable["Asset"]:
    """Find assets in the catalog.

    Returns an iterable of Asset objects matching the specified criteria.
    If no criteria are specified, returns all assets from all asset tables.

    Args:
        asset_table: Optional table or table name to search. If None, searches
            all asset tables.
        asset_type: Optional asset type to filter by. Only returns assets
            with this type.

    Returns:
        Iterable of Asset objects matching the criteria.

    Example:
        >>> # Find all assets in the Model table
        >>> models = list(ml.find_assets(asset_table="Model"))  # doctest: +SKIP

        >>> # Find all assets with type "Training_Data"
        >>> training = list(ml.find_assets(asset_type="Training_Data"))  # doctest: +SKIP

        >>> # Find all assets across all tables
        >>> all_assets = list(ml.find_assets())  # doctest: +SKIP
    """
    # Determine which tables to search
    if asset_table is not None:
        tables = [self.model.name_to_table(asset_table)]
    else:
        tables = self.list_asset_tables()

    for table in tables:
        # Get all assets from this table (now returns Asset objects)
        for asset in self.list_assets(table):
            # Filter by asset type if specified
            if asset_type is not None:
                if asset_type not in asset.asset_types:
                    continue
            yield asset

find_datasets

find_datasets(
    deleted: bool = False,
    sort: SortSpec = None,
) -> Iterable["Dataset"]

List all datasets in the catalog.

Parameters:

Name	Type	Description	Default
`deleted`	`bool`	If True, include datasets that have been marked as deleted.	`False`
`sort`	`SortSpec`	Optional sort spec. - `None` (default): backend-determined order (no sort clause applied; cheapest path). - `True`: newest-first by record creation time (`RCT desc`). Recommended for "show me the most recent datasets" queries. - Callable `(path) -> sort_keys`: receives the Dataset table path and returns one or more path-builder sort keys.	`None`

Returns:

Type	Description
`Iterable['Dataset']`	Iterable of Dataset objects.

Example

datasets = list(ml.find_datasets()) # doctest: +SKIP for ds in datasets: # doctest: +SKIP ... print(f"{ds.dataset_rid}: {ds.description}")

Newest-first (most common):

recent = list(ml.find_datasets(sort=True)) # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/dataset.py

def find_datasets(self, deleted: bool = False, sort: SortSpec = None) -> Iterable["Dataset"]:
    """List all datasets in the catalog.

    Args:
        deleted: If True, include datasets that have been marked as deleted.
        sort: Optional sort spec.
            - ``None`` (default): backend-determined order (no sort
              clause applied; cheapest path).
            - ``True``: newest-first by record creation time
              (``RCT desc``). Recommended for "show me the most
              recent datasets" queries.
            - Callable ``(path) -> sort_keys``: receives the
              Dataset table path and returns one or more
              path-builder sort keys.

    Returns:
        Iterable of Dataset objects.

    Example:
        >>> datasets = list(ml.find_datasets())  # doctest: +SKIP
        >>> for ds in datasets:  # doctest: +SKIP
        ...     print(f"{ds.dataset_rid}: {ds.description}")

        Newest-first (most common):

        >>> recent = list(ml.find_datasets(sort=True))  # doctest: +SKIP
    """
    # Import here to avoid circular imports
    from deriva_ml.dataset.dataset import Dataset

    # Get datapath to the Dataset table
    pb = self.pathBuilder()
    dataset_path = pb.schemas[self._dataset_table.schema.name].tables[self._dataset_table.name]

    if deleted:
        filtered_path = dataset_path
    else:
        filtered_path = dataset_path.filter(
            (dataset_path.Deleted == False) | (dataset_path.Deleted == None)  # noqa: E711, E712
        )

    # Resolve sort spec against this method's default (newest-first
    # by record creation time). resolve_sort returns None when the
    # caller explicitly opted out of sorting (sort=None), in which
    # case we don't call .sort() at all -- backend default order.
    entity_set = filtered_path.entities()
    sort_keys = resolve_sort(sort, lambda p: p.RCT.desc, dataset_path)
    if sort_keys is not None:
        entity_set = entity_set.sort(*sort_keys)

    # Create Dataset objects - dataset_types is now a property that fetches from catalog
    datasets = []
    for dataset in entity_set.fetch():
        datasets.append(
            Dataset(
                self,  # type: ignore[arg-type]
                dataset_rid=dataset["RID"],
                description=dataset["Description"],
            )
        )
    return datasets

find_experiments

find_experiments(
    workflow_rid: RID | None = None,
    status: ExecutionStatus
    | None = None,
) -> Iterable["Experiment"]

List all experiments (executions with Hydra configuration) in the catalog.

Creates Experiment objects for analyzing completed ML model runs. Only returns executions that have Hydra configuration metadata (i.e., a config.yaml file in Execution_Metadata assets).

Parameters:

Name	Type	Description	Default
`workflow_rid`	`RID \| None`	Optional workflow RID to filter by.	`None`
`status`	`ExecutionStatus \| None`	Optional status to filter by (e.g., ExecutionStatus.Uploaded).	`None`

Returns:

Type	Description
`Iterable['Experiment']`	Iterable of Experiment objects for executions with Hydra config.

Example

experiments = list(ml.find_experiments(status=ExecutionStatus.Uploaded)) # doctest: +SKIP for exp in experiments: # doctest: +SKIP ... print(f"{exp.name}: {exp.config_choices}")

Source code in src/deriva_ml/core/mixins/execution.py

def find_experiments(
    self,
    workflow_rid: RID | None = None,
    status: ExecutionStatus | None = None,
) -> Iterable["Experiment"]:
    """List all experiments (executions with Hydra configuration) in the catalog.

    Creates Experiment objects for analyzing completed ML model runs.
    Only returns executions that have Hydra configuration metadata
    (i.e., a config.yaml file in Execution_Metadata assets).

    Args:
        workflow_rid: Optional workflow RID to filter by.
        status: Optional status to filter by (e.g., ExecutionStatus.Uploaded).

    Returns:
        Iterable of Experiment objects for executions with Hydra config.

    Example:
        >>> experiments = list(ml.find_experiments(status=ExecutionStatus.Uploaded))  # doctest: +SKIP
        >>> for exp in experiments:  # doctest: +SKIP
        ...     print(f"{exp.name}: {exp.config_choices}")
    """
    import re

    from deriva_ml.experiment import Experiment

    # Get datapath to tables
    pb = self.pathBuilder()
    execution_path = pb.schemas[self.ml_schema].Execution
    metadata_path = pb.schemas[self.ml_schema].Execution_Metadata
    meta_exec_path = pb.schemas[self.ml_schema].Execution_Metadata_Execution

    # Find executions that have metadata assets with config.yaml files
    # Query the association table to find executions with hydra config metadata
    exec_rids_with_config = set()

    # Get all metadata records and filter for config.yaml files in Python
    # (ERMrest regex support varies by deployment)
    config_pattern = re.compile(r".*-config\.yaml$")
    config_metadata_rids = set()
    for meta in metadata_path.entities().fetch():
        filename = meta.get("Filename", "")
        if filename and config_pattern.match(filename):
            config_metadata_rids.add(meta["RID"])

    if config_metadata_rids:
        # Query the association table to find which executions have these metadata
        for assoc_record in meta_exec_path.entities().fetch():
            if assoc_record.get("Execution_Metadata") in config_metadata_rids:
                exec_rids_with_config.add(assoc_record["Execution"])

    # Apply additional filters and yield Experiment objects
    filtered_path = execution_path
    if workflow_rid:
        filtered_path = filtered_path.filter(execution_path.Workflow == workflow_rid)
    if status:
        filtered_path = filtered_path.filter(execution_path.Status == status.value)

    for exec_record in filtered_path.entities().fetch():
        if exec_record["RID"] in exec_rids_with_config:
            yield Experiment(self, exec_record["RID"])  # type: ignore[arg-type]

find_features

find_features(
    table: str | Table | None = None,
) -> list[Feature]

Find feature definitions in the schema.

Discovers features by inspecting the catalog schema for association tables that have Feature_Name and Execution columns. Returns Feature objects describing each feature's structure (target table, term/asset/value columns), not the feature values themselves.

Use feature_values to retrieve actual feature values.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table \| None`	Optional table to find features for. If None, returns all feature definitions across all tables.	`None`

Returns:

Type	Description
`list[Feature]`	A list of Feature instances describing the feature definitions.

Examples:

Find all feature definitions: >>> all_features = ml.find_features() # doctest: +SKIP >>> for f in all_features: # doctest: +SKIP ... print(f"{f.target_table.name}.{f.feature_name}")

Find features defined on a specific table: >>> image_features = ml.find_features("Image") # doctest: +SKIP >>> print([f.feature_name for f in image_features]) # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/feature.py

def find_features(self, table: str | Table | None = None) -> list[Feature]:
    """Find feature definitions in the schema.

    Discovers features by inspecting the catalog schema for association tables
    that have ``Feature_Name`` and ``Execution`` columns. Returns Feature objects
    describing each feature's structure (target table, term/asset/value columns),
    not the feature values themselves.

    Use ``feature_values`` to retrieve actual feature values.

    Args:
        table: Optional table to find features for. If None, returns all feature
            definitions across all tables.

    Returns:
        A list of Feature instances describing the feature definitions.

    Examples:
        Find all feature definitions:
            >>> all_features = ml.find_features()  # doctest: +SKIP
            >>> for f in all_features:  # doctest: +SKIP
            ...     print(f"{f.target_table.name}.{f.feature_name}")

        Find features defined on a specific table:
            >>> image_features = ml.find_features("Image")  # doctest: +SKIP
            >>> print([f.feature_name for f in image_features])  # doctest: +SKIP
    """
    return list(self.model.find_features(table))

find_incomplete_executions

find_incomplete_executions() -> (
    list[ExecutionSnapshot]
)

Sugar over :meth:list_executions for everything not terminally done.

Reads from the workspace SQLite registry — no server contact. Returns executions in status in (Created, Running, Stopped, Failed, Pending_Upload) — the set of things a user would want to either resume, retry, or clean up. Excludes Uploaded (terminal success) and Aborted (terminal cleanup).

For live catalog queries returning mutable :class:~deriva_ml.execution.execution_record.ExecutionRecord objects, see find_executions(status=...).

Returns:

Type	Description
`list[ExecutionSnapshot]`	List of `ExecutionSnapshot` Pydantic models for each incomplete
`list[ExecutionSnapshot]`	execution known to the local registry.

Example

for snap in ml.find_incomplete_executions(): # doctest: +SKIP ... print(snap.rid, snap.status, snap.pending_rows)

Source code in src/deriva_ml/core/mixins/execution.py

def find_incomplete_executions(self) -> list[ExecutionSnapshot]:
    """Sugar over :meth:`list_executions` for everything not terminally done.

    Reads from the workspace SQLite registry — no server contact.
    Returns executions in status in (Created, Running, Stopped, Failed,
    Pending_Upload) — the set of things a user would want to either
    resume, retry, or clean up. Excludes Uploaded (terminal success)
    and Aborted (terminal cleanup).

    For live catalog queries returning mutable
    :class:`~deriva_ml.execution.execution_record.ExecutionRecord`
    objects, see ``find_executions(status=...)``.

    Returns:
        List of ``ExecutionSnapshot`` Pydantic models for each incomplete
        execution known to the local registry.

    Example:
        >>> for snap in ml.find_incomplete_executions():  # doctest: +SKIP
        ...     print(snap.rid, snap.status, snap.pending_rows)
    """
    return self.list_executions(
        status=[
            ExecutionStatus.Created,
            ExecutionStatus.Running,
            ExecutionStatus.Stopped,
            ExecutionStatus.Failed,
            ExecutionStatus.Pending_Upload,
        ],
    )

find_workflows

find_workflows(
    sort: SortSpec = None,
) -> list[Workflow]

Find all workflows in the catalog.

Catalog-level operation to find all workflow definitions, including their names, URLs, types, versions, and descriptions. Each returned Workflow is bound to the catalog, allowing its description to be updated.

Parameters:

Name	Type	Description	Default
`sort`	`SortSpec`	Optional sort spec. - `None` (default): backend-determined order (no sort clause applied; cheapest path). - `True`: newest-first by record creation time (`RCT desc`). Recommended for "show me the most recent workflows" queries. - Callable `(path) -> sort_keys`: receives the Workflow table path and returns one or more path-builder sort keys.	`None`

Returns:

Type	Description
`list[Workflow]`	list[Workflow]: List of workflow objects, each containing: - name: Workflow name - url: Source code URL - workflow_type: Type(s) of workflow - version: Version identifier - description: Workflow description - workflow_rid: Resource identifier - checksum: Source code checksum

Examples:

List all workflows and their descriptions::

>>> workflows = ml.find_workflows()  # doctest: +SKIP
>>> for w in workflows:  # doctest: +SKIP
...     print(f"{w.name} (v{w.version}): {w.description}")
...     print(f"  Source: {w.url}")

Update a workflow's description (workflows are catalog-bound)::

>>> workflows = ml.find_workflows()  # doctest: +SKIP
>>> workflows[0].description = "Updated description"  # doctest: +SKIP

Newest-first (most common)::

>>> recent = list(ml.find_workflows(sort=True))  # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/workflow.py

def find_workflows(self, sort: SortSpec = None) -> list[Workflow]:
    """Find all workflows in the catalog.

    Catalog-level operation to find all workflow definitions, including their
    names, URLs, types, versions, and descriptions. Each returned Workflow
    is bound to the catalog, allowing its description to be updated.

    Args:
        sort: Optional sort spec.
            - ``None`` (default): backend-determined order (no sort
              clause applied; cheapest path).
            - ``True``: newest-first by record creation time
              (``RCT desc``). Recommended for "show me the most
              recent workflows" queries.
            - Callable ``(path) -> sort_keys``: receives the
              Workflow table path and returns one or more
              path-builder sort keys.

    Returns:
        list[Workflow]: List of workflow objects, each containing:
            - name: Workflow name
            - url: Source code URL
            - workflow_type: Type(s) of workflow
            - version: Version identifier
            - description: Workflow description
            - workflow_rid: Resource identifier
            - checksum: Source code checksum

    Examples:
        List all workflows and their descriptions::

            >>> workflows = ml.find_workflows()  # doctest: +SKIP
            >>> for w in workflows:  # doctest: +SKIP
            ...     print(f"{w.name} (v{w.version}): {w.description}")
            ...     print(f"  Source: {w.url}")

        Update a workflow's description (workflows are catalog-bound)::

            >>> workflows = ml.find_workflows()  # doctest: +SKIP
            >>> workflows[0].description = "Updated description"  # doctest: +SKIP

        Newest-first (most common)::

            >>> recent = list(ml.find_workflows(sort=True))  # doctest: +SKIP
    """
    # Get a workflow table path and fetch all workflows.
    # Pre-fetch the full Workflow_Workflow_Type table once and index
    # it by RID instead of issuing one association-table query per
    # workflow — that 1+N pattern dominated find_workflows on
    # catalogs with hundreds of workflows.
    workflow_path = self.pathBuilder().schemas[self.ml_schema].Workflow
    entity_set = workflow_path.entities()
    sort_keys = resolve_sort(sort, lambda p: p.RCT.desc, workflow_path)
    if sort_keys is not None:
        entity_set = entity_set.sort(*sort_keys)
    types_index = self._get_workflow_types_index()
    workflows = []
    for w in entity_set.fetch():
        workflow = Workflow(
            name=w["Name"],
            url=w["URL"],
            workflow_type=types_index.get(w["RID"], []),
            version=w["Version"],
            description=w["Description"],
            workflow_rid=w["RID"],
            checksum=w["Checksum"],
        )
        # Bind the workflow to this catalog instance
        workflow._ml_instance = self  # type: ignore[assignment]
        workflows.append(workflow)
    return workflows

from_context `classmethod`

from_context(
    path: Path | str | None = None,
) -> Self

Create a DerivaML instance from a .deriva-context.json file.

Searches for .deriva-context.json starting from path (default: cwd), walking up parent directories. This enables scripts generated by Claude to connect to the same catalog without hardcoding connection details.

The context file is written by the MCP server's connect_catalog tool and contains hostname, catalog_id, and default_schema.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str \| None`	Starting directory to search for the context file. Defaults to the current working directory.	`None`

Returns:

Type	Description
`Self`	A new DerivaML instance configured from the context file.

Raises:

Type	Description
`FileNotFoundError`	If no .deriva-context.json is found.

Example::

# In a script generated by Claude:
from deriva_ml import DerivaML
ml = DerivaML.from_context()
subjects = ml.cache_table("Subject")

Source code in src/deriva_ml/core/base.py

@classmethod
def from_context(cls, path: Path | str | None = None) -> Self:
    """Create a DerivaML instance from a .deriva-context.json file.

    Searches for .deriva-context.json starting from ``path`` (default: cwd),
    walking up parent directories. This enables scripts generated by Claude
    to connect to the same catalog without hardcoding connection details.

    The context file is written by the MCP server's ``connect_catalog`` tool
    and contains hostname, catalog_id, and default_schema.

    Args:
        path: Starting directory to search for the context file.
            Defaults to the current working directory.

    Returns:
        A new DerivaML instance configured from the context file.

    Raises:
        FileNotFoundError: If no .deriva-context.json is found.

    Example::

        # In a script generated by Claude:
        from deriva_ml import DerivaML
        ml = DerivaML.from_context()
        subjects = ml.cache_table("Subject")
    """
    import json

    start = Path(path) if path else Path.cwd()
    context_file = _find_context_file(start)
    with open(context_file) as f:
        ctx = json.load(f)

    kwargs: dict[str, Any] = {
        "hostname": ctx["hostname"],
        "catalog_id": ctx["catalog_id"],
    }
    if ctx.get("default_schema"):
        kwargs["default_schema"] = ctx["default_schema"]
    if ctx.get("working_dir"):
        kwargs["working_dir"] = ctx["working_dir"]

    return cls(**kwargs)

gc_executions

gc_executions(
    *,
    older_than: "timedelta | None" = None,
    status: "ExecutionStatus | list[ExecutionStatus] | None" = None,
    delete_working_dir: bool = False,
) -> int

Garbage-collect execution registry rows matching the filters.

By default only removes registry state (SQLite rows and their pending_rows / directory_rules). Pass delete_working_dir=True to also rm -rf the on-disk execution root under the workspace.

Does NOT touch the catalog. Executions uploaded to the catalog remain there regardless of local gc.

Parameters:

Name	Type	Description	Default
`older_than`	`'timedelta \| None'`	If set, only gc executions whose last_activity is older than this timedelta.	`None`
`status`	`'ExecutionStatus \| list[ExecutionStatus] \| None'`	Filter by status (single or list); None = any status. Typical: pass ExecutionStatus.Uploaded to clean up after successful uploads.	`None`
`delete_working_dir`	`bool`	If True, remove the per-execution working directory from disk. Defaults to False (registry-only).	`False`

Returns:

Type	Description
`int`	The number of executions removed.

Example

from datetime import timedelta # doctest: +SKIP from deriva_ml.execution.state_store import ExecutionStatus # doctest: +SKIP n = ml.gc_executions( # doctest: +SKIP ... status=ExecutionStatus.Uploaded, ... older_than=timedelta(days=30), ... delete_working_dir=True, ... ) print(f"cleaned {n} old executions") # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/execution.py

def gc_executions(
    self,
    *,
    older_than: "timedelta | None" = None,
    status: "ExecutionStatus | list[ExecutionStatus] | None" = None,
    delete_working_dir: bool = False,
) -> int:
    """Garbage-collect execution registry rows matching the filters.

    By default only removes registry state (SQLite rows and their
    pending_rows / directory_rules). Pass delete_working_dir=True to
    also ``rm -rf`` the on-disk execution root under the workspace.

    Does NOT touch the catalog. Executions uploaded to the catalog
    remain there regardless of local gc.

    Args:
        older_than: If set, only gc executions whose last_activity is
            older than this timedelta.
        status: Filter by status (single or list); None = any status.
            Typical: pass ExecutionStatus.Uploaded to clean up after
            successful uploads.
        delete_working_dir: If True, remove the per-execution working
            directory from disk. Defaults to False (registry-only).

    Returns:
        The number of executions removed.

    Example:
        >>> from datetime import timedelta  # doctest: +SKIP
        >>> from deriva_ml.execution.state_store import ExecutionStatus  # doctest: +SKIP
        >>> n = ml.gc_executions(  # doctest: +SKIP
        ...     status=ExecutionStatus.Uploaded,
        ...     older_than=timedelta(days=30),
        ...     delete_working_dir=True,
        ... )
        >>> print(f"cleaned {n} old executions")  # doctest: +SKIP
    """
    import shutil
    from datetime import datetime, timezone
    from pathlib import Path

    store = self.workspace.execution_state_store()

    # Pull the filtered row list from SQLite, then narrow by
    # last_activity if older_than was provided.
    rows = store.list_executions(status=status)
    if older_than is not None:
        cutoff = datetime.now(timezone.utc) - older_than
        # SQLite's DateTime(timezone=True) stores as ISO text and
        # returns naive datetimes; coerce both sides to naive UTC
        # to avoid offset-aware/naive comparison errors.

        def _is_older(last_activity: datetime) -> bool:
            la = last_activity
            if la.tzinfo is None:
                la = la.replace(tzinfo=timezone.utc)
            return la < cutoff

        rows = [r for r in rows if _is_older(r["last_activity"])]

    for row in rows:
        if delete_working_dir:
            wd = Path(self.working_dir) / row["working_dir_rel"]
            if wd.exists():
                shutil.rmtree(wd)
        store.delete_execution(row["rid"])

    return len(rows)

get_cache_size

get_cache_size() -> dict[
    str, int | float
]

Get the current size of the cache directory.

Returns:

Type	Description
`dict[str, int \| float]`	dict with keys: - 'total_bytes': Total size in bytes - 'total_mb': Total size in megabytes - 'file_count': Number of files - 'dir_count': Number of directories

Example

ml = DerivaML('deriva.example.org', 'my_catalog') # doctest: +SKIP size = ml.get_cache_size() # doctest: +SKIP print(f"Cache size: {size['total_mb']:.1f} MB ({size['file_count']} files)") # doctest: +SKIP

Source code in src/deriva_ml/core/base.py

def get_cache_size(self) -> dict[str, int | float]:
    """Get the current size of the cache directory.

    Returns:
        dict with keys:
            - 'total_bytes': Total size in bytes
            - 'total_mb': Total size in megabytes
            - 'file_count': Number of files
            - 'dir_count': Number of directories

    Example:
        >>> ml = DerivaML('deriva.example.org', 'my_catalog')  # doctest: +SKIP
        >>> size = ml.get_cache_size()  # doctest: +SKIP
        >>> print(f"Cache size: {size['total_mb']:.1f} MB ({size['file_count']} files)")  # doctest: +SKIP
    """
    stats = {"total_bytes": 0, "total_mb": 0.0, "file_count": 0, "dir_count": 0}

    if not self.cache_dir.exists():
        return stats

    for entry in self.cache_dir.rglob("*"):
        if entry.is_file():
            stats["total_bytes"] += entry.stat().st_size
            stats["file_count"] += 1
        elif entry.is_dir():
            stats["dir_count"] += 1

    stats["total_mb"] = stats["total_bytes"] / (1024 * 1024)
    return stats

get_column_annotations

get_column_annotations(
    table: str | Table, column_name: str
) -> dict[str, Any]

Get all Chaise display-related annotations for a column.

Returns display and column-display annotations. Missing annotations are None.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Table name (str) or `Table` object.	required
`column_name`	`str`	Name of the column.	required

Returns:

Type	Description
`dict[str, Any]`	Dict with keys `table` (str), `schema` (str),
`dict[str, Any]`	`column` (str), `display` (dict \| None),
`dict[str, Any]`	`column_display` (dict \| None).

Raises:

Type	Description
`DerivaMLTableTypeError`	If `table` is not found in the catalog model.
`DerivaMLException`	If `column_name` is not a column of `table`.

Example

anns = ml.get_column_annotations("Image", "Filename") # doctest: +SKIP anns["display"] # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/annotation.py

@validate_call(config=VALIDATION_CONFIG)
def get_column_annotations(self, table: str | Table, column_name: str) -> dict[str, Any]:
    """Get all Chaise display-related annotations for a column.

    Returns display and column-display annotations. Missing annotations
    are ``None``.

    Args:
        table: Table name (str) or ``Table`` object.
        column_name: Name of the column.

    Returns:
        Dict with keys ``table`` (str), ``schema`` (str),
        ``column`` (str), ``display`` (dict | None),
        ``column_display`` (dict | None).

    Raises:
        DerivaMLTableTypeError: If ``table`` is not found in the catalog model.
        DerivaMLException: If ``column_name`` is not a column of ``table``.

    Example:
        >>> anns = ml.get_column_annotations("Image", "Filename")  # doctest: +SKIP
        >>> anns["display"]  # doctest: +SKIP
    """
    table_obj = self.model.name_to_table(table)
    column = table_obj.columns[column_name]
    return {
        "table": table_obj.name,
        "column": column.name,
        "display": column.annotations.get(DISPLAY_TAG),
        "column_display": column.annotations.get(COLUMN_DISPLAY_TAG),
    }

get_handlebars_template_variables

get_handlebars_template_variables(
    table: str | Table,
) -> dict[str, Any]

Get all available template variables for a table.

Returns the columns, foreign keys, and special variables that can be used in Handlebars templates (row_markdown_pattern, markdown_pattern, etc.) for the specified table.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Table name or Table object.	required

Returns:

Type	Description
`dict[str, Any]`	Dictionary with columns, foreign_keys, special_variables, and helper_examples.

Example

vars = ml.get_handlebars_template_variables("Image") # doctest: +SKIP for col in vars["columns"]: # doctest: +SKIP ... print(f"{col['name']}: {col['template']}")

Source code in src/deriva_ml/core/mixins/annotation.py

@validate_call(config=VALIDATION_CONFIG)
def get_handlebars_template_variables(self, table: str | Table) -> dict[str, Any]:
    """Get all available template variables for a table.

    Returns the columns, foreign keys, and special variables that can be
    used in Handlebars templates (row_markdown_pattern, markdown_pattern, etc.)
    for the specified table.

    Args:
        table: Table name or Table object.

    Returns:
        Dictionary with columns, foreign_keys, special_variables, and helper_examples.

    Example:
        >>> vars = ml.get_handlebars_template_variables("Image")  # doctest: +SKIP
        >>> for col in vars["columns"]:  # doctest: +SKIP
        ...     print(f"{col['name']}: {col['template']}")
    """
    table_obj = self.model.name_to_table(table)

    # Get columns
    columns = []
    for col in table_obj.columns:
        columns.append(
            {
                "name": col.name,
                "type": str(col.type.typename),
                "template": "{{{" + col.name + "}}}",
                "row_template": "{{{_row." + col.name + "}}}",
            }
        )

    # Get foreign keys (outbound)
    foreign_keys = []
    for fkey in table_obj.foreign_keys:
        schema_name = fkey.constraint_schema.name
        constraint_name = fkey.constraint_name
        fk_path = f"$fkeys.{schema_name}.{constraint_name}"

        # Get columns from referenced table
        ref_columns = [col.name for col in fkey.pk_table.columns]

        foreign_keys.append(
            {
                "constraint": [schema_name, constraint_name],
                "from_columns": [col.name for col in fkey.columns],
                "to_table": fkey.pk_table.name,
                "to_columns": ref_columns,
                "values_template": "{{{" + fk_path + ".values.COLUMN}}}",
                "row_name_template": "{{{" + fk_path + ".rowName}}}",
                "example_column_templates": [
                    "{{{" + fk_path + ".values." + c + "}}}"
                    for c in ref_columns[:3]  # Show first 3 as examples
                ],
            }
        )

    return {
        "table": table_obj.name,
        "columns": columns,
        "foreign_keys": foreign_keys,
        "special_variables": {
            "_value": {"description": "Current column value (in column_display)", "template": "{{{_value}}}"},
            "_row": {"description": "Object with all row columns", "template": "{{{_row.column_name}}}"},
            "$catalog.id": {"description": "Catalog ID", "template": "{{{$catalog.id}}}"},
            "$catalog.snapshot": {"description": "Current snapshot ID", "template": "{{{$catalog.snapshot}}}"},
        },
        "helper_examples": {
            "conditional": "{{#if column}}...{{else}}...{{/if}}",
            "iteration": "{{#each array}}{{{this}}}{{/each}}",
            "comparison": "{{#ifCond val1 '==' val2}}...{{/ifCond}}",
            "date_format": "{{formatDate RCT 'YYYY-MM-DD'}}",
            "json_output": "{{toJSON object}}",
        },
    }

get_storage_summary

get_storage_summary() -> dict[str, Any]

Get a summary of local storage usage.

Returns:

Type	Description
`dict[str, Any]`	dict with keys: - 'working_dir': Path to working directory - 'cache_dir': Path to cache directory - 'cache_size_mb': Cache size in MB - 'cache_file_count': Number of files in cache - 'execution_dir_count': Number of execution directories - 'execution_size_mb': Total size of execution directories in MB - 'total_size_mb': Combined size in MB

Example

ml = DerivaML('deriva.example.org', 'my_catalog') # doctest: +SKIP summary = ml.get_storage_summary() # doctest: +SKIP print(f"Total storage: {summary['total_size_mb']:.1f} MB") # doctest: +SKIP print(f" Cache: {summary['cache_size_mb']:.1f} MB") # doctest: +SKIP print(f" Executions: {summary['execution_size_mb']:.1f} MB") # doctest: +SKIP

Source code in src/deriva_ml/core/base.py

def get_storage_summary(self) -> dict[str, Any]:
    """Get a summary of local storage usage.

    Returns:
        dict with keys:
            - 'working_dir': Path to working directory
            - 'cache_dir': Path to cache directory
            - 'cache_size_mb': Cache size in MB
            - 'cache_file_count': Number of files in cache
            - 'execution_dir_count': Number of execution directories
            - 'execution_size_mb': Total size of execution directories in MB
            - 'total_size_mb': Combined size in MB

    Example:
        >>> ml = DerivaML('deriva.example.org', 'my_catalog')  # doctest: +SKIP
        >>> summary = ml.get_storage_summary()  # doctest: +SKIP
        >>> print(f"Total storage: {summary['total_size_mb']:.1f} MB")  # doctest: +SKIP
        >>> print(f"  Cache: {summary['cache_size_mb']:.1f} MB")  # doctest: +SKIP
        >>> print(f"  Executions: {summary['execution_size_mb']:.1f} MB")  # doctest: +SKIP
    """
    cache_stats = self.get_cache_size()
    exec_dirs = self.list_execution_dirs()

    exec_size_mb = sum(d["size_mb"] for d in exec_dirs)

    return {
        "working_dir": str(self.working_dir),
        "cache_dir": str(self.cache_dir),
        "cache_size_mb": cache_stats["total_mb"],
        "cache_file_count": cache_stats["file_count"],
        "execution_dir_count": len(exec_dirs),
        "execution_size_mb": exec_size_mb,
        "total_size_mb": cache_stats["total_mb"] + exec_size_mb,
    }

get_table_annotations

get_table_annotations(
    table: str | Table,
) -> dict[str, Any]

Get all Chaise display-related annotations for a table.

Returns the current values of display, visible-columns, visible-foreign-keys, and table-display annotations. Missing annotations are represented as None in the returned dict.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Table name (str) or `Table` object.	required

Returns:

Type	Description
`dict[str, Any]`	Dict with keys `table` (str), `schema` (str),
`dict[str, Any]`	`display` (dict \| None), `visible_columns` (dict \| None),
`dict[str, Any]`	`visible_foreign_keys` (dict \| None), `table_display`
`dict[str, Any]`	(dict \| None).

Raises:

Type	Description
`DerivaMLTableTypeError`	If `table` is not found in the catalog model.

Example

anns = ml.get_table_annotations("Image") # doctest: +SKIP anns["visible_columns"] # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/annotation.py

@validate_call(config=VALIDATION_CONFIG)
def get_table_annotations(self, table: str | Table) -> dict[str, Any]:
    """Get all Chaise display-related annotations for a table.

    Returns the current values of display, visible-columns,
    visible-foreign-keys, and table-display annotations. Missing
    annotations are represented as ``None`` in the returned dict.

    Args:
        table: Table name (str) or ``Table`` object.

    Returns:
        Dict with keys ``table`` (str), ``schema`` (str),
        ``display`` (dict | None), ``visible_columns`` (dict | None),
        ``visible_foreign_keys`` (dict | None), ``table_display``
        (dict | None).

    Raises:
        DerivaMLTableTypeError: If ``table`` is not found in the catalog model.

    Example:
        >>> anns = ml.get_table_annotations("Image")  # doctest: +SKIP
        >>> anns["visible_columns"]  # doctest: +SKIP
    """
    table_obj = self.model.name_to_table(table)
    return {
        "table": table_obj.name,
        "schema": table_obj.schema.name,
        "display": table_obj.annotations.get(DISPLAY_TAG),
        "visible_columns": table_obj.annotations.get(VISIBLE_COLUMNS_TAG),
        "visible_foreign_keys": table_obj.annotations.get(VISIBLE_FOREIGN_KEYS_TAG),
        "table_display": table_obj.annotations.get(TABLE_DISPLAY_TAG),
    }

get_table_as_dataframe

get_table_as_dataframe(
    table: str,
) -> pd.DataFrame

Get table contents as a pandas DataFrame.

Retrieves all contents of a table from the catalog.

Parameters:

Name	Type	Description	Default
`table`	`str`	Name of the table to retrieve.	required

Returns:

Type	Description
`DataFrame`	DataFrame containing all table contents.

Raises:

Type	Description
`DerivaMLTableNotFound`	If the table does not exist in any schema.

Example

df = ml.get_table_as_dataframe("Subject") # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/path_builder.py

def get_table_as_dataframe(self, table: str) -> pd.DataFrame:
    """Get table contents as a pandas DataFrame.

    Retrieves all contents of a table from the catalog.

    Args:
        table: Name of the table to retrieve.

    Returns:
        DataFrame containing all table contents.

    Raises:
        DerivaMLTableNotFound: If the table does not exist in any schema.

    Example:
        >>> df = ml.get_table_as_dataframe("Subject")  # doctest: +SKIP
    """
    return rows_to_dataframe(self.get_table_as_dict(table))

get_table_as_dict

get_table_as_dict(
    table: str,
) -> Iterable[dict[str, Any]]

Get table contents as dictionaries.

Retrieves all contents of a table from the catalog.

Parameters:

Name	Type	Description	Default
`table`	`str`	Name of the table to retrieve.	required

Returns:

Type	Description
`Iterable[dict[str, Any]]`	Iterable yielding dictionaries for each row.

Raises:

Type	Description
`DerivaMLTableNotFound`	If the table does not exist in any schema.

Example

rows = list(ml.get_table_as_dict("Subject")) # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/path_builder.py

def get_table_as_dict(self, table: str) -> Iterable[dict[str, Any]]:
    """Get table contents as dictionaries.

    Retrieves all contents of a table from the catalog.

    Args:
        table: Name of the table to retrieve.

    Returns:
        Iterable yielding dictionaries for each row.

    Raises:
        DerivaMLTableNotFound: If the table does not exist in any schema.

    Example:
        >>> rows = list(ml.get_table_as_dict("Subject"))  # doctest: +SKIP
    """
    table_obj = self.model.name_to_table(table)
    pb = self.pathBuilder()
    yield from pb.schemas[table_obj.schema.name].tables[table_obj.name].entities().fetch()

instantiate `classmethod`

instantiate(
    config: DerivaMLConfig,
) -> Self

Create a DerivaML instance from a configuration object.

This method is the preferred way to instantiate DerivaML when using hydra-zen for configuration management. It accepts a DerivaMLConfig (Pydantic model) and unpacks it to create the instance.

This pattern allows hydra-zen's instantiate() to work with DerivaML:

Example with hydra-zen

from hydra_zen import builds, instantiate # doctest: +SKIP from deriva_ml import DerivaML # doctest: +SKIP from deriva_ml.core.config import DerivaMLConfig # doctest: +SKIP

Create a structured config using hydra-zen

DerivaMLConf = builds(DerivaMLConfig, populate_full_signature=True) # doctest: +SKIP

Configure for your environment

conf = DerivaMLConf( # doctest: +SKIP ... hostname='deriva.example.org', ... catalog_id='42', ... domain_schemas={'my_domain'}, ... )

Instantiate the config to get a DerivaMLConfig object

config = instantiate(conf) # doctest: +SKIP

Create the DerivaML instance

ml = DerivaML.instantiate(config) # doctest: +SKIP

Parameters:

Name	Type	Description	Default
`config`	`DerivaMLConfig`	A DerivaMLConfig object containing all configuration parameters.	required

Returns:

Type	Description
`Self`	A new DerivaML instance configured according to the config object.

Note

The DerivaMLConfig class integrates with Hydra's configuration system and registers custom resolvers for computing working directories. See deriva_ml.core.config for details on configuration options.

Source code in src/deriva_ml/core/base.py

@classmethod
def instantiate(cls, config: DerivaMLConfig) -> Self:
    """Create a DerivaML instance from a configuration object.

    This method is the preferred way to instantiate DerivaML when using hydra-zen
    for configuration management. It accepts a DerivaMLConfig (Pydantic model) and
    unpacks it to create the instance.

    This pattern allows hydra-zen's `instantiate()` to work with DerivaML:

    Example with hydra-zen:
        >>> from hydra_zen import builds, instantiate  # doctest: +SKIP
        >>> from deriva_ml import DerivaML  # doctest: +SKIP
        >>> from deriva_ml.core.config import DerivaMLConfig  # doctest: +SKIP
        >>>
        >>> # Create a structured config using hydra-zen
        >>> DerivaMLConf = builds(DerivaMLConfig, populate_full_signature=True)  # doctest: +SKIP
        >>>
        >>> # Configure for your environment
        >>> conf = DerivaMLConf(  # doctest: +SKIP
        ...     hostname='deriva.example.org',
        ...     catalog_id='42',
        ...     domain_schemas={'my_domain'},
        ... )
        >>>
        >>> # Instantiate the config to get a DerivaMLConfig object
        >>> config = instantiate(conf)  # doctest: +SKIP
        >>>
        >>> # Create the DerivaML instance
        >>> ml = DerivaML.instantiate(config)  # doctest: +SKIP

    Args:
        config: A DerivaMLConfig object containing all configuration parameters.

    Returns:
        A new DerivaML instance configured according to the config object.

    Note:
        The DerivaMLConfig class integrates with Hydra's configuration system
        and registers custom resolvers for computing working directories.
        See `deriva_ml.core.config` for details on configuration options.
    """
    return cls(**config.model_dump())

is_snapshot

is_snapshot() -> bool

Check whether this DerivaML instance is connected to a catalog snapshot.

Returns:

Type	Description
`bool`	True if the underlying catalog has a snapshot timestamp, False otherwise.

Source code in src/deriva_ml/core/base.py

def is_snapshot(self) -> bool:
    """Check whether this DerivaML instance is connected to a catalog snapshot.

    Returns:
        True if the underlying catalog has a snapshot timestamp, False otherwise.
    """
    return hasattr(self.catalog, "_snaptime")

is_strict_preallocated_rid

is_strict_preallocated_rid(
    table: str | Table,
) -> bool

Return True if the asset table has the strict-preallocated-RID annotation set.

Checks for the tag:isrd.isi.edu,2026:strict-preallocated-rid annotation. Returns True iff the annotation is present with {"strict": true}.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Asset table name or Table object.	required

Returns:

Type	Description
`bool`	True if strict mode is set on this table, False otherwise.

Source code in src/deriva_ml/core/mixins/annotation.py

@validate_call(config=VALIDATION_CONFIG)
def is_strict_preallocated_rid(self, table: str | Table) -> bool:
    """Return True if the asset table has the strict-preallocated-RID annotation set.

    Checks for the ``tag:isrd.isi.edu,2026:strict-preallocated-rid``
    annotation. Returns True iff the annotation is present with
    ``{"strict": true}``.

    Args:
        table: Asset table name or Table object.

    Returns:
        True if strict mode is set on this table, False otherwise.
    """
    table_obj = self.model.name_to_table(table)
    anno = table_obj.annotations.get(STRICT_PREALLOCATED_RID_TAG, {})
    if not isinstance(anno, dict):
        return False
    return bool(anno.get("strict", False))

list_asset_executions

list_asset_executions(
    asset_rid: str,
    asset_role: str | None = None,
) -> list["ExecutionRecord"]

List all executions associated with an asset.

Given an asset RID, returns a list of executions that created or used the asset, along with the role (Input/Output) in each execution.

Parameters:

Name	Type	Description	Default
`asset_rid`	`str`	The RID of the asset to look up.	required
`asset_role`	`str \| None`	Optional filter for asset role ('Input' or 'Output'). If None, returns all associations.	`None`

Returns:

Type	Description
`list['ExecutionRecord']`	list[ExecutionRecord]: List of ExecutionRecord objects for the executions associated with this asset.

Raises:

Type	Description
`DerivaMLException`	If the asset RID is not found or not an asset.

Example

Find all executions that created this asset

executions = ml.list_asset_executions("1-abc123", asset_role="Output") # doctest: +SKIP for exe in executions: # doctest: +SKIP ... print(f"Created by execution {exe.execution_rid}") # doctest: +SKIP

Find all executions that used this asset as input

executions = ml.list_asset_executions("1-abc123", asset_role="Input") # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/asset.py

def list_asset_executions(self, asset_rid: str, asset_role: str | None = None) -> list["ExecutionRecord"]:
    """List all executions associated with an asset.

    Given an asset RID, returns a list of executions that created or used
    the asset, along with the role (Input/Output) in each execution.

    Args:
        asset_rid: The RID of the asset to look up.
        asset_role: Optional filter for asset role ('Input' or 'Output').
            If None, returns all associations.

    Returns:
        list[ExecutionRecord]: List of ExecutionRecord objects for the
            executions associated with this asset.

    Raises:
        DerivaMLException: If the asset RID is not found or not an asset.

    Example:
        >>> # Find all executions that created this asset
        >>> executions = ml.list_asset_executions("1-abc123", asset_role="Output")  # doctest: +SKIP
        >>> for exe in executions:  # doctest: +SKIP
        ...     print(f"Created by execution {exe.execution_rid}")  # doctest: +SKIP

        >>> # Find all executions that used this asset as input
        >>> executions = ml.list_asset_executions("1-abc123", asset_role="Input")  # doctest: +SKIP
    """
    # Resolve the RID to find which asset table it belongs to
    rid_info = self.resolve_rid(asset_rid)  # type: ignore[attr-defined]
    asset_table = rid_info.table

    if not self.model.is_asset(asset_table):
        raise DerivaMLException(f"RID {asset_rid} is not an asset (table: {asset_table.name})")

    # Find the association table between this asset table and Execution
    asset_exe_table, asset_fk, execution_fk = self.model.find_association(asset_table, "Execution")

    # Build the query
    pb = self.pathBuilder()
    asset_exe_path = pb.schemas[asset_exe_table.schema.name].tables[asset_exe_table.name]

    # Filter by asset RID
    query = asset_exe_path.filter(asset_exe_path.columns[asset_fk] == asset_rid)

    # Optionally filter by asset role
    if asset_role:
        query = query.filter(asset_exe_path.Asset_Role == asset_role)

    # Convert to ExecutionRecord objects
    records = list(query.entities().fetch())
    return [self.lookup_execution(record["Execution"]) for record in records]  # type: ignore[attr-defined]

list_asset_tables

list_asset_tables() -> list[Table]

List all asset tables in the catalog.

Returns:

Type	Description
`list[Table]`	List of Table objects that are asset tables.

Example

for table in ml.list_asset_tables(): # doctest: +SKIP ... print(f"Asset table: {table.name}") # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/asset.py

def list_asset_tables(self) -> list[Table]:
    """List all asset tables in the catalog.

    Returns:
        List of Table objects that are asset tables.

    Example:
        >>> for table in ml.list_asset_tables():  # doctest: +SKIP
        ...     print(f"Asset table: {table.name}")  # doctest: +SKIP
    """
    tables = []
    # Include asset tables from all domain schemas
    for domain_schema in self.domain_schemas:
        if domain_schema in self.model.schemas:
            tables.extend([t for t in self.model.schemas[domain_schema].tables.values() if self.model.is_asset(t)])
    # Also include ML schema asset tables (like Execution_Asset)
    tables.extend([t for t in self.model.schemas[self.ml_schema].tables.values() if self.model.is_asset(t)])
    return tables

list_assets

list_assets(
    asset_table: Table | str,
) -> list["Asset"]

Lists contents of an asset table.

Returns a list of Asset objects for the specified asset table. Asset types are pre-fetched in a single query and joined client-side to avoid an N+1 round-trip pattern: for an asset table with N rows, the catalog is hit twice (once for the assets, once for the Asset_Type association rows) regardless of N.

Parameters:

Name	Type	Description	Default
`asset_table`	`Table \| str`	Table or name of the asset table to list assets for.	required

Returns:

Type	Description
`list['Asset']`	list[Asset]: List of Asset objects for the assets in the table.

Raises:

Type	Description
`DerivaMLException`	If the table is not an asset table or doesn't exist.

Example

assets = ml.list_assets("Image") # doctest: +SKIP for asset in assets: # doctest: +SKIP ... print(f"{asset.asset_rid}: {asset.filename}") # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/asset.py

def list_assets(self, asset_table: Table | str) -> list["Asset"]:
    """Lists contents of an asset table.

    Returns a list of Asset objects for the specified asset table. Asset
    types are pre-fetched in a single query and joined client-side to
    avoid an N+1 round-trip pattern: for an asset table with N rows, the
    catalog is hit twice (once for the assets, once for the
    ``Asset_Type`` association rows) regardless of N.

    Args:
        asset_table: Table or name of the asset table to list assets for.

    Returns:
        list[Asset]: List of Asset objects for the assets in the table.

    Raises:
        DerivaMLException: If the table is not an asset table or doesn't exist.

    Example:
        >>> assets = ml.list_assets("Image")  # doctest: +SKIP
        >>> for asset in assets:  # doctest: +SKIP
        ...     print(f"{asset.asset_rid}: {asset.filename}")  # doctest: +SKIP
    """
    from deriva_ml.asset.asset import Asset

    # Validate and get asset table reference
    asset_table_obj = self.model.name_to_table(asset_table)
    if not self.model.is_asset(asset_table_obj):
        raise DerivaMLException(f"Table {asset_table_obj.name} is not an asset")

    # Get path builders for asset and type tables
    pb = self.pathBuilder()
    asset_path = pb.schemas[asset_table_obj.schema.name].tables[asset_table_obj.name]
    (
        asset_type_table,
        _,
        _,
    ) = self.model.find_association(asset_table_obj, MLVocab.asset_type)
    type_path = pb.schemas[asset_type_table.schema.name].tables[asset_type_table.name]

    # Pre-fetch the entire {asset_rid -> [type_name, ...]} mapping in a
    # single round-trip. This replaces what was previously a per-asset
    # filtered fetch (N round-trips for N assets) with one bulk query
    # plus an in-memory groupby. For a 10K-row asset table on a
    # localhost catalog this is a ~10-min difference.
    asset_rid_col = asset_table_obj.name
    type_col = MLVocab.asset_type.value
    asset_rid_to_types: dict[str, list[str]] = {}
    for row in type_path.attributes(type_path.columns[asset_rid_col], type_path.columns[type_col]).fetch():
        asset_rid_to_types.setdefault(row[asset_rid_col], []).append(row[type_col])

    # Single bulk fetch of all asset rows, then attach types from the map.
    return [
        Asset(
            catalog=self,  # type: ignore[arg-type]
            asset_rid=asset_record["RID"],
            asset_table=asset_table_obj.name,
            filename=asset_record.get("Filename", ""),
            url=asset_record.get("URL", ""),
            length=asset_record.get("Length", 0),
            md5=asset_record.get("MD5", ""),
            description=asset_record.get("Description", ""),
            asset_types=asset_rid_to_types.get(asset_record["RID"], []),
        )
        for asset_record in asset_path.entities().fetch()
    ]

list_dataset_element_types

list_dataset_element_types() -> (
    Iterable[Table]
)

List the table types that can be added as dataset members.

Thin wrapper over :meth:DerivaModel.list_dataset_element_types; the model layer owns the filter logic.

Returns:

Type	Description
`Iterable[Table]`	Iterable of `Table` objects representing valid member types.

Raises:

Type	Description
`DerivaMLException`	If the catalog schema cannot be read.

Example

types = ml.list_dataset_element_types() # doctest: +SKIP print([t.name for t in types]) # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/dataset.py

def list_dataset_element_types(self) -> Iterable[Table]:
    """List the table types that can be added as dataset members.

    Thin wrapper over :meth:`DerivaModel.list_dataset_element_types`;
    the model layer owns the filter logic.

    Returns:
        Iterable of ``Table`` objects representing valid member types.

    Raises:
        DerivaMLException: If the catalog schema cannot be read.

    Example:
        >>> types = ml.list_dataset_element_types()  # doctest: +SKIP
        >>> print([t.name for t in types])  # doctest: +SKIP
    """
    return self.model.list_dataset_element_types()

list_execution_dirs

list_execution_dirs() -> list[
    dict[str, Any]
]

List execution working directories.

Returns information about each execution directory in the working directory, useful for identifying orphaned or incomplete execution outputs.

Returns:

Type	Description
`list[dict[str, Any]]`	List of dicts, each containing: - 'execution_rid': The execution RID (directory name) - 'path': Full path to the directory - 'size_bytes': Total size in bytes - 'size_mb': Total size in megabytes - 'modified': Last modification time (datetime) - 'file_count': Number of files

Example

ml = DerivaML('deriva.example.org', 'my_catalog') # doctest: +SKIP dirs = ml.list_execution_dirs() # doctest: +SKIP for d in dirs: # doctest: +SKIP ... print(f"{d['execution_rid']}: {d['size_mb']:.1f} MB")

Source code in src/deriva_ml/core/base.py

def list_execution_dirs(self) -> list[dict[str, Any]]:
    """List execution working directories.

    Returns information about each execution directory in the working directory,
    useful for identifying orphaned or incomplete execution outputs.

    Returns:
        List of dicts, each containing:
            - 'execution_rid': The execution RID (directory name)
            - 'path': Full path to the directory
            - 'size_bytes': Total size in bytes
            - 'size_mb': Total size in megabytes
            - 'modified': Last modification time (datetime)
            - 'file_count': Number of files

    Example:
        >>> ml = DerivaML('deriva.example.org', 'my_catalog')  # doctest: +SKIP
        >>> dirs = ml.list_execution_dirs()  # doctest: +SKIP
        >>> for d in dirs:  # doctest: +SKIP
        ...     print(f"{d['execution_rid']}: {d['size_mb']:.1f} MB")
    """
    from datetime import datetime

    from deriva_ml.core.upload_layout import upload_root

    results = []
    exec_root = upload_root(self.working_dir) / "execution"

    if not exec_root.exists():
        return results

    for entry in exec_root.iterdir():
        if entry.is_dir():
            size_bytes = sum(f.stat().st_size for f in entry.rglob("*") if f.is_file())
            file_count = sum(1 for f in entry.rglob("*") if f.is_file())
            mtime = datetime.fromtimestamp(entry.stat().st_mtime)

            results.append(
                {
                    "execution_rid": entry.name,
                    "path": str(entry),
                    "size_bytes": size_bytes,
                    "size_mb": size_bytes / (1024 * 1024),
                    "modified": mtime,
                    "file_count": file_count,
                }
            )

    return sorted(results, key=lambda x: x["modified"], reverse=True)

list_executions

list_executions(
    *,
    status: "ExecutionStatus | list[ExecutionStatus] | None" = None,
    workflow_rid: str | None = None,
    mode: "ConnectionMode | None" = None,
    since: datetime | None = None,
) -> list[ExecutionSnapshot]

Enumerate locally-known executions from the SQLite registry.

Reads from the workspace SQLite registry — no server contact. Works in both online and offline mode. Each returned ExecutionSnapshot is a frozen Pydantic value object captured at query time; it cannot mutate the catalog. Pending-row counts are included in the same pass.

For live catalog queries that return mutable :class:~deriva_ml.execution.execution_record.ExecutionRecord objects bound to the catalog, see find_executions() and lookup_execution().

Parameters:

Name	Type	Description	Default
`status`	`'ExecutionStatus \| list[ExecutionStatus] \| None'`	Single ExecutionStatus or list to filter; None = all.	`None`
`workflow_rid`	`str \| None`	Match only executions tagged with this Workflow RID; None = all.	`None`
`mode`	`'ConnectionMode \| None'`	ConnectionMode the execution was last active under; None = all.	`None`
`since`	`datetime \| None`	Return only executions with last_activity >= this timestamp (timezone-aware). None = no time filter.	`None`

Returns:

Type	Description
`list[ExecutionSnapshot]`	List of `ExecutionSnapshot` Pydantic models — one per matching
`list[ExecutionSnapshot]`	row in the registry. Empty list if nothing matches.

Example

from deriva_ml.execution.state_store import ExecutionStatus # doctest: +SKIP failed = ml.list_executions(status=ExecutionStatus.Failed) # doctest: +SKIP for snap in failed: # doctest: +SKIP ... print(snap.rid, snap.error)

Source code in src/deriva_ml/core/mixins/execution.py

def list_executions(
    self,
    *,
    status: "ExecutionStatus | list[ExecutionStatus] | None" = None,
    workflow_rid: str | None = None,
    mode: "ConnectionMode | None" = None,
    since: datetime | None = None,
) -> list[ExecutionSnapshot]:
    """Enumerate locally-known executions from the SQLite registry.

    Reads from the workspace SQLite registry — **no server contact**.
    Works in both online and offline mode. Each returned
    ``ExecutionSnapshot`` is a frozen Pydantic value object captured
    at query time; it cannot mutate the catalog. Pending-row counts
    are included in the same pass.

    For live catalog queries that return mutable
    :class:`~deriva_ml.execution.execution_record.ExecutionRecord`
    objects bound to the catalog, see ``find_executions()`` and
    ``lookup_execution()``.

    Args:
        status: Single ExecutionStatus or list to filter; None = all.
        workflow_rid: Match only executions tagged with this Workflow
            RID; None = all.
        mode: ConnectionMode the execution was last active under;
            None = all.
        since: Return only executions with last_activity >= this
            timestamp (timezone-aware). None = no time filter.

    Returns:
        List of ``ExecutionSnapshot`` Pydantic models — one per matching
        row in the registry. Empty list if nothing matches.

    Example:
        >>> from deriva_ml.execution.state_store import ExecutionStatus  # doctest: +SKIP
        >>> failed = ml.list_executions(status=ExecutionStatus.Failed)  # doctest: +SKIP
        >>> for snap in failed:  # doctest: +SKIP
        ...     print(snap.rid, snap.error)
    """
    store = self.workspace.execution_state_store()
    rows = store.list_executions(
        status=status,
        workflow_rid=workflow_rid,
        mode=mode,
        since=since,
    )
    return [
        ExecutionSnapshot.from_row(row, **store.count_pending_by_kind(execution_rid=row["rid"])) for row in rows
    ]

list_files

list_files(
    file_types: list[str] | None = None,
) -> list[dict[str, Any]]

Lists files in the catalog with their metadata.

Returns a list of files with their metadata including URL, MD5 hash, length, description, and associated file types. Files can be optionally filtered by type.

Parameters:

Name	Type	Description	Default
`file_types`	`list[str] \| None`	Filter results to only include these file types.	`None`

Returns:

Type	Description
`list[dict[str, Any]]`	list[dict[str, Any]]: List of file records, each containing: - RID: Resource identifier - URL: File location - MD5: File hash - Length: File size - Description: File description - File_Types: List of associated file types

Examples:

List all files: >>> files = ml.list_files() # doctest: +SKIP >>> for f in files: # doctest: +SKIP ... print(f"{f['RID']}: {f['URL']}")

Filter by file type: >>> image_files = ml.list_files(["image", "png"]) # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/file.py

def list_files(self, file_types: list[str] | None = None) -> list[dict[str, Any]]:
    """Lists files in the catalog with their metadata.

    Returns a list of files with their metadata including URL, MD5 hash, length, description,
    and associated file types. Files can be optionally filtered by type.

    Args:
        file_types: Filter results to only include these file types.

    Returns:
        list[dict[str, Any]]: List of file records, each containing:
            - RID: Resource identifier
            - URL: File location
            - MD5: File hash
            - Length: File size
            - Description: File description
            - File_Types: List of associated file types

    Examples:
        List all files:
            >>> files = ml.list_files()  # doctest: +SKIP
            >>> for f in files:  # doctest: +SKIP
            ...     print(f"{f['RID']}: {f['URL']}")

        Filter by file type:
            >>> image_files = ml.list_files(["image", "png"])  # doctest: +SKIP
    """
    asset_type_atable, file_fk, asset_type_fk = self.model.find_association("File", "Asset_Type")
    ml_path = self.pathBuilder().schemas[self.ml_schema]
    file = ml_path.File
    asset_type = ml_path.tables[asset_type_atable.name]

    path = file.path
    path = path.link(asset_type.alias("AT"), on=file.RID == asset_type.columns[file_fk], join_type="left")
    if file_types:
        path = path.filter(asset_type.columns[asset_type_fk] == datapath.Any(*file_types))
    path = path.attributes(
        path.File.RID,
        path.File.URL,
        path.File.MD5,
        path.File.Length,
        path.File.Description,
        path.AT.columns[asset_type_fk],
    )

    file_map = {}
    for f in path.fetch():
        entry = file_map.setdefault(f["RID"], {**f, "File_Types": []})
        if ft := f.get("Asset_Type"):  # assign-and-test in one go
            entry["File_Types"].append(ft)

    # Now get rid of the File_Type key and return the result
    return [(f, f.pop("Asset_Type"))[0] for f in file_map.values()]

list_vocabulary_terms

list_vocabulary_terms(
    table: str | Table,
) -> list[VocabularyTerm]

Lists all terms in a vocabulary table.

Retrieves all terms, their descriptions, and synonyms from a controlled vocabulary table.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Vocabulary table to list terms from (name or Table object).	required

Returns:

Type	Description
`list[VocabularyTerm]`	list[VocabularyTerm]: List of vocabulary terms with their metadata.

Raises:

Type	Description
`DerivaMLException`	If table doesn't exist or is not a vocabulary table.

Examples:

>>> terms = ml.list_vocabulary_terms("tissue_types")
>>> for term in terms:
...     print(f"{term.name}: {term.description}")
...     if term.synonyms:
...         print(f"  Synonyms: {', '.join(term.synonyms)}")

Source code in src/deriva_ml/core/mixins/vocabulary.py

def list_vocabulary_terms(self, table: str | Table) -> list[VocabularyTerm]:
    """Lists all terms in a vocabulary table.

    Retrieves all terms, their descriptions, and synonyms from a controlled vocabulary table.

    Args:
        table: Vocabulary table to list terms from (name or Table object).

    Returns:
        list[VocabularyTerm]: List of vocabulary terms with their metadata.

    Raises:
        DerivaMLException: If table doesn't exist or is not a vocabulary table.

    Examples:
        >>> terms = ml.list_vocabulary_terms("tissue_types")  # doctest: +SKIP
        >>> for term in terms:  # doctest: +SKIP
        ...     print(f"{term.name}: {term.description}")
        ...     if term.synonyms:
        ...         print(f"  Synonyms: {', '.join(term.synonyms)}")
    """
    # Get path builder and table reference
    pb = self.pathBuilder()
    table = self.model.name_to_table(table.value if isinstance(table, MLVocab) else table)

    # Validate table is a vocabulary table. Mirror add_term and
    # lookup_term so all three "not a vocabulary table" call
    # sites raise the same exception class.
    if not (self.model.is_vocabulary(table)):
        raise DerivaMLTableTypeError("vocabulary", table.name)

    # Fetch and convert all terms to VocabularyTerm objects
    return [VocabularyTerm(**v) for v in pb.schemas[table.schema.name].tables[table.name].entities().fetch()]

list_workflow_executions

list_workflow_executions(
    workflow: str,
) -> list[str]

Return execution RIDs that ran the given workflow.

The workflow argument resolves in two steps: first as a Workflow RID, and if that fails, as a Workflow_Type name. The returned list contains every execution RID for every workflow that matches.

This method is the catalog-backed building block for FeatureRecord.select_by_workflow(workflow, container=ml) — it resolves the workflow's execution set once, and the selector closes over the result for cheap per-group membership testing.

Entries are unique by construction (each execution runs one workflow). Consumers that need O(1) membership testing convert to set at the call site.

Parameters:

Name	Type	Description	Default
`workflow`	`str`	Workflow RID (e.g., `"2-ABC1"`) or Workflow_Type name (e.g., `"Training"`).	required

Returns:

Type	Description
`list[str]`	List of execution RIDs, in insertion order. May be empty if the
`list[str]`	workflow exists but has no executions yet.

Raises:

Type	Description
`DerivaMLException`	If `workflow` does not resolve as a Workflow RID nor as a Workflow_Type name.

Example

List all executions of a workflow and count them::

>>> rids = ml.list_workflow_executions("Glaucoma_Training_v2")  # doctest: +SKIP
>>> print(f"{len(rids)} executions of this workflow")  # doctest: +SKIP

Use as the catalog-backed resolver for the selector factory::

>>> from deriva_ml.feature import FeatureRecord  # doctest: +SKIP
>>> sel = FeatureRecord.select_by_workflow(  # doctest: +SKIP
...     "Glaucoma_Training_v2", container=ml,
... )

Source code in src/deriva_ml/core/mixins/feature.py

@validate_call(config=VALIDATION_CONFIG)
def list_workflow_executions(self, workflow: str) -> list[str]:
    """Return execution RIDs that ran the given workflow.

    The ``workflow`` argument resolves in two steps: first as a Workflow
    RID, and if that fails, as a Workflow_Type name. The returned list
    contains every execution RID for every workflow that matches.

    This method is the catalog-backed building block for
    ``FeatureRecord.select_by_workflow(workflow, container=ml)`` — it
    resolves the workflow's execution set once, and the selector closes
    over the result for cheap per-group membership testing.

    Entries are unique by construction (each execution runs one workflow).
    Consumers that need O(1) membership testing convert to ``set`` at the
    call site.

    Args:
        workflow: Workflow RID (e.g., ``"2-ABC1"``) or Workflow_Type name
            (e.g., ``"Training"``).

    Returns:
        List of execution RIDs, in insertion order. May be empty if the
        workflow exists but has no executions yet.

    Raises:
        DerivaMLException: If ``workflow`` does not resolve as a Workflow
            RID nor as a Workflow_Type name.

    Example:
        List all executions of a workflow and count them::

            >>> rids = ml.list_workflow_executions("Glaucoma_Training_v2")  # doctest: +SKIP
            >>> print(f"{len(rids)} executions of this workflow")  # doctest: +SKIP

        Use as the catalog-backed resolver for the selector factory::

            >>> from deriva_ml.feature import FeatureRecord  # doctest: +SKIP
            >>> sel = FeatureRecord.select_by_workflow(  # doctest: +SKIP
            ...     "Glaucoma_Training_v2", container=ml,
            ... )
    """
    # Try RID first — narrow scope so catalog errors inside find_executions propagate.
    try:
        wf = self.lookup_workflow(workflow)
    except DerivaMLException:
        wf = None

    if wf is not None:
        return [r.execution_rid for r in self.find_executions(workflow=wf)]

    # Fallback: treat `workflow` as a Workflow_Type name.
    # find_executions(workflow_type=...) returns an empty generator (not raise)
    # for unknown type names, so we check whether the result is empty and raise
    # only when neither the RID path nor the type-name path matched anything.
    rids = [r.execution_rid for r in self.find_executions(workflow_type=workflow)]
    if not rids:
        raise DerivaMLException(
            f"No workflow resolved for '{workflow}' — tried as Workflow RID and Workflow_Type name."
        )
    return rids

lookup_asset

lookup_asset(asset_rid: RID) -> 'Asset'

Look up an asset by its RID.

Returns an Asset object for the specified RID. The asset can be from any asset table in the catalog.

Parameters:

Name	Type	Description	Default
`asset_rid`	`RID`	The RID of the asset to look up.	required

Returns:

Type	Description
`'Asset'`	Asset object for the specified RID.

Raises:

Type	Description
`DerivaMLException`	If the RID is not found or is not an asset.

Example

asset = ml.lookup_asset("3JSE") # doctest: +SKIP print(f"File: {asset.filename}, Table: {asset.asset_table}") # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/asset.py

def lookup_asset(self, asset_rid: RID) -> "Asset":
    """Look up an asset by its RID.

    Returns an Asset object for the specified RID. The asset can be from
    any asset table in the catalog.

    Args:
        asset_rid: The RID of the asset to look up.

    Returns:
        Asset object for the specified RID.

    Raises:
        DerivaMLException: If the RID is not found or is not an asset.

    Example:
        >>> asset = ml.lookup_asset("3JSE")  # doctest: +SKIP
        >>> print(f"File: {asset.filename}, Table: {asset.asset_table}")  # doctest: +SKIP
    """
    from deriva_ml.asset.asset import Asset

    # Resolve the RID; ``resolve_rid`` returns a datapath already
    # filtered to this RID so we can fetch the record in one
    # round-trip instead of a second .filter() pass.
    rid_info = self.resolve_rid(asset_rid)  # type: ignore[attr-defined]
    asset_table = rid_info.table

    if not self.model.is_asset(asset_table):
        raise DerivaMLException(f"RID {asset_rid} is not an asset (table: {asset_table.name})")

    records = list(rid_info.datapath.entities().fetch())
    if not records:
        raise DerivaMLException(f"Asset {asset_rid} not found in table {asset_table.name}")

    record = records[0]
    # Keep ``pb`` available for the asset-type lookup below.
    pb = self.pathBuilder()

    # Get asset types
    asset_types = []
    try:
        type_assoc_table, asset_fk, _ = self.model.find_association(asset_table, "Asset_Type")
    except NoAssociationException:
        # No type association for this asset table
        pass
    else:
        type_path = pb.schemas[type_assoc_table.schema.name].tables[type_assoc_table.name]
        types = list(
            type_path.filter(type_path.columns[asset_fk] == asset_rid).attributes(type_path.Asset_Type).fetch()
        )
        asset_types = [t["Asset_Type"] for t in types]

    return Asset(
        catalog=self,  # type: ignore[arg-type]
        asset_rid=asset_rid,
        asset_table=asset_table.name,
        filename=record.get("Filename", ""),
        url=record.get("URL", ""),
        length=record.get("Length", 0),
        md5=record.get("MD5", ""),
        description=record.get("Description", ""),
        asset_types=asset_types,
    )

lookup_dataset

lookup_dataset(
    dataset: RID | DatasetSpec,
    deleted: bool = False,
) -> "Dataset"

Look up a dataset by RID or DatasetSpec.

Parameters:

Name	Type	Description	Default
`dataset`	`RID \| DatasetSpec`	Dataset RID or DatasetSpec to look up.	required
`deleted`	`bool`	If True, include datasets that have been marked as deleted.	`False`

Returns:

Name	Type	Description
`Dataset`	`'Dataset'`	The dataset object for the specified RID.

Raises:

Type	Description
`DerivaMLException`	If the dataset is not found.

Example

dataset = ml.lookup_dataset("4HM") # doctest: +SKIP print(f"Version: {dataset.current_version}") # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/dataset.py

def lookup_dataset(self, dataset: RID | DatasetSpec, deleted: bool = False) -> "Dataset":
    """Look up a dataset by RID or DatasetSpec.

    Args:
        dataset: Dataset RID or DatasetSpec to look up.
        deleted: If True, include datasets that have been marked as deleted.

    Returns:
        Dataset: The dataset object for the specified RID.

    Raises:
        DerivaMLException: If the dataset is not found.

    Example:
        >>> dataset = ml.lookup_dataset("4HM")  # doctest: +SKIP
        >>> print(f"Version: {dataset.current_version}")  # doctest: +SKIP
    """
    # Import here to avoid circular imports.
    from deriva_ml.dataset.dataset import Dataset

    if isinstance(dataset, DatasetSpec):
        dataset_rid = dataset.rid
    else:
        dataset_rid = dataset

    # Server-side RID filter — the catalog returns exactly the
    # row(s) matching this RID rather than every dataset in the
    # catalog. The previous implementation fetched every dataset
    # row and built every ``Dataset`` object just to filter
    # client-side; for catalogs with thousands of datasets that
    # was O(N) per lookup and a meaningful latency hit.
    pb = self.pathBuilder()
    dataset_path = pb.schemas[self._dataset_table.schema.name].tables[self._dataset_table.name]
    filtered_path = dataset_path.filter(dataset_path.RID == dataset_rid)
    if not deleted:
        filtered_path = filtered_path.filter(
            (dataset_path.Deleted == False) | (dataset_path.Deleted == None)  # noqa: E711, E712
        )

    rows = list(filtered_path.entities().fetch())
    if not rows:
        raise DerivaMLException(f"Dataset {dataset_rid} not found.")

    row = rows[0]
    return Dataset(
        self,  # type: ignore[arg-type]
        dataset_rid=row["RID"],
        description=row["Description"],
    )

lookup_execution

lookup_execution(
    execution_rid: RID,
) -> "ExecutionRecord"

Look up a single execution by RID in the live catalog.

Queries the ERMrest catalog for the Execution row with the given RID and returns an ExecutionRecord — a live, catalog-bound value whose mutable properties (status, description) write through to the catalog on assignment. Online mode only.

For enumerating executions from the local SQLite registry without touching the catalog, see list_executions(). For catalog-side filter queries returning live records, see find_executions().

Parameters:

Name	Type	Description	Default
`execution_rid`	`RID`	Resource Identifier (RID) of the execution.	required

Returns:

Type	Description
`'ExecutionRecord'`	A live `ExecutionRecord` bound to the catalog. Property
`'ExecutionRecord'`	setters (`record.status = ...`) write through.

Raises:

Type	Description
`DerivaMLException`	If execution_rid is not valid or doesn't refer to an Execution record.

Example

record = ml.lookup_execution("1-abc123") # doctest: +SKIP record.status = ExecutionStatus.Uploaded # writes to catalog # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/execution.py

def lookup_execution(self, execution_rid: RID) -> "ExecutionRecord":
    """Look up a single execution by RID in the live catalog.

    Queries the ERMrest catalog for the Execution row with the given
    RID and returns an ``ExecutionRecord`` — a live, catalog-bound
    value whose mutable properties (``status``, ``description``)
    write through to the catalog on assignment. Online mode only.

    For enumerating executions from the local SQLite registry without
    touching the catalog, see ``list_executions()``. For catalog-side
    filter queries returning live records, see ``find_executions()``.

    Args:
        execution_rid: Resource Identifier (RID) of the execution.

    Returns:
        A live ``ExecutionRecord`` bound to the catalog. Property
        setters (``record.status = ...``) write through.

    Raises:
        DerivaMLException: If execution_rid is not valid or doesn't
            refer to an Execution record.

    Example:
        >>> record = ml.lookup_execution("1-abc123")  # doctest: +SKIP
        >>> record.status = ExecutionStatus.Uploaded   # writes to catalog  # doctest: +SKIP
    """
    # Import here to avoid circular dependency
    from deriva_ml.execution.execution_record import ExecutionRecord

    # Get execution record from catalog and verify it's an Execution
    resolved = self.resolve_rid(execution_rid)
    if resolved.table.name != "Execution":
        raise DerivaMLException(f"RID '{execution_rid}' refers to a {resolved.table.name}, not an Execution")

    execution_data = self._retrieve_rid(execution_rid)

    # Parse timestamps if present
    start_time = None
    stop_time = None
    if execution_data.get("Start"):
        from datetime import datetime

        try:
            start_time = datetime.fromisoformat(execution_data["Start"].replace("Z", "+00:00"))
        except (ValueError, AttributeError):
            pass
    if execution_data.get("Stop"):
        from datetime import datetime

        try:
            stop_time = datetime.fromisoformat(execution_data["Stop"].replace("Z", "+00:00"))
        except (ValueError, AttributeError):
            pass

    # Look up the workflow if present
    workflow_rid = execution_data.get("Workflow")
    workflow = self.lookup_workflow(workflow_rid) if workflow_rid else None

    # Create ExecutionRecord bound to this catalog. Reads the three
    # per-phase duration columns added 2026-05-19; old catalogs that
    # predate the schema bump report None for all three (forward-only
    # migration — see docs/bugs/2026-05-19-execution-phase-durations-design.md).
    record = ExecutionRecord(
        execution_rid=execution_rid,
        workflow=workflow,
        status=ExecutionStatus(execution_data.get("Status") or "Created"),
        description=execution_data.get("Description"),
        start_time=start_time,
        stop_time=stop_time,
        duration=execution_data.get("Execution_Duration"),
        download_duration=execution_data.get("Download_Duration"),
        upload_duration=execution_data.get("Upload_Duration"),
        _ml_instance=self,
        _logger=getattr(self, "_logger", None),
    )

    return record

lookup_experiment

lookup_experiment(
    execution_rid: RID,
) -> "Experiment"

Look up an experiment by execution RID.

Creates an Experiment object for analyzing completed executions. Provides convenient access to execution metadata, configuration choices, model parameters, inputs, and outputs.

Parameters:

Name	Type	Description	Default
`execution_rid`	`RID`	Resource Identifier (RID) of the execution.	required

Returns:

Name	Type	Description
`Experiment`	`'Experiment'`	An experiment object for the given execution RID.

Example

exp = ml.lookup_experiment("47BE") # doctest: +SKIP print(exp.name) # e.g., "cifar10_quick" # doctest: +SKIP print(exp.config_choices) # Hydra config names used # doctest: +SKIP print(exp.model_config) # Model hyperparameters # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/execution.py

def lookup_experiment(self, execution_rid: RID) -> "Experiment":
    """Look up an experiment by execution RID.

    Creates an Experiment object for analyzing completed executions.
    Provides convenient access to execution metadata, configuration choices,
    model parameters, inputs, and outputs.

    Args:
        execution_rid: Resource Identifier (RID) of the execution.

    Returns:
        Experiment: An experiment object for the given execution RID.

    Example:
        >>> exp = ml.lookup_experiment("47BE")  # doctest: +SKIP
        >>> print(exp.name)  # e.g., "cifar10_quick"  # doctest: +SKIP
        >>> print(exp.config_choices)  # Hydra config names used  # doctest: +SKIP
        >>> print(exp.model_config)  # Model hyperparameters  # doctest: +SKIP
    """
    from deriva_ml.experiment import Experiment

    return Experiment(self, execution_rid)  # type: ignore[arg-type]

lookup_feature

lookup_feature(
    table: str | Table,
    feature_name: str,
) -> Feature

Look up a feature definition by table and name.

Returns a Feature object that describes the schema structure of a feature — not the feature values themselves. A Feature is a schema-level descriptor derived by inspecting the catalog's association tables. It tells you:

What table the feature annotates (target_table) — e.g., Image
Where values are stored (feature_table) — the association table linking targets to values and executions
What kind of values it holds, classified by column role:
term_columns: columns referencing controlled vocabulary tables (e.g., a Diagnosis_Type column pointing to a vocabulary of diagnosis terms)
asset_columns: columns referencing asset tables (e.g., a Segmentation_Mask column)
value_columns: columns holding direct values like floats, ints, or text (e.g., a confidence score)

The Feature object also provides feature_record_class(), which returns a dynamically generated Pydantic model for constructing validated feature records to insert into the catalog.

To retrieve actual feature values, use feature_values instead.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	The table the feature is defined on (name or Table object).	required
`feature_name`	`str`	Name of the feature to look up.	required

Returns:

Type	Description
`Feature`	A Feature schema descriptor.

Raises:

Type	Description
`DerivaMLFeatureNotFound`	If the feature doesn't exist on the specified table. Subclass of DerivaMLException — existing `except DerivaMLException` callers still catch it.

Example

feature = ml.lookup_feature("Image", "Classification") # doctest: +SKIP print(f"Feature: {feature.feature_name}") # doctest: +SKIP print(f"Stored in: {feature.feature_table.name}") # doctest: +SKIP print(f"Term columns: {[c.name for c in feature.term_columns]}") # doctest: +SKIP print(f"Value columns: {[c.name for c in feature.value_columns]}") # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/feature.py

def lookup_feature(self, table: str | Table, feature_name: str) -> Feature:
    """Look up a feature definition by table and name.

    Returns a ``Feature`` object that describes the **schema structure**
    of a feature — not the feature values themselves. A Feature is a
    schema-level descriptor derived by inspecting the catalog's
    association tables. It tells you:

    - **What table the feature annotates** (``target_table``) — e.g., Image
    - **Where values are stored** (``feature_table``) — the association
      table linking targets to values and executions
    - **What kind of values it holds**, classified by column role:

      - ``term_columns``: columns referencing controlled vocabulary
        tables (e.g., a ``Diagnosis_Type`` column pointing to a
        vocabulary of diagnosis terms)
      - ``asset_columns``: columns referencing asset tables (e.g., a
        ``Segmentation_Mask`` column)
      - ``value_columns``: columns holding direct values like floats,
        ints, or text (e.g., a ``confidence`` score)

    The Feature object also provides ``feature_record_class()``, which
    returns a dynamically generated Pydantic model for constructing
    validated feature records to insert into the catalog.

    To retrieve actual feature **values**, use ``feature_values``
    instead.

    Args:
        table: The table the feature is defined on (name or Table object).
        feature_name: Name of the feature to look up.

    Returns:
        A Feature schema descriptor.

    Raises:
        DerivaMLFeatureNotFound: If the feature doesn't exist on the
            specified table. Subclass of DerivaMLException — existing
            ``except DerivaMLException`` callers still catch it.

    Example:
        >>> feature = ml.lookup_feature("Image", "Classification")  # doctest: +SKIP
        >>> print(f"Feature: {feature.feature_name}")  # doctest: +SKIP
        >>> print(f"Stored in: {feature.feature_table.name}")  # doctest: +SKIP
        >>> print(f"Term columns: {[c.name for c in feature.term_columns]}")  # doctest: +SKIP
        >>> print(f"Value columns: {[c.name for c in feature.value_columns]}")  # doctest: +SKIP
    """
    return self.model.lookup_feature(table, feature_name)

lookup_lineage

lookup_lineage(
    rid: RID,
    *,
    depth: int | None = None,
    max_executions: int = 500,
) -> "LineageResult"

Walk the data-flow provenance chain behind an artifact.

Given a Dataset, Asset, Feature value, or Execution RID, returns a tree of producing executions and their consumed inputs back to the natural root of every branch. Replaces what would otherwise be 5-15 client round-trips through typed read methods with one call.

The walk follows data-flow parents only: for each execution node, the parents are the producing executions of its consumed datasets and assets (asset_role="Input"). This method explicitly does NOT walk Execution_Execution (orchestration links) — that's a different question (ExecutionRecord.list_execution_parents / list_execution_children). See docs/adr/0001-lineage-walks-data-flow-not-orchestration.md for the rationale.

For Dataset roots, the producer is taken from the current version's Dataset_Version.Execution row. Walking a historical version is a future enhancement.

Parameters:

Name	Type	Description	Default
`rid`	`RID`	RID of any Dataset, Asset, Feature value, or Execution. Workflow RIDs are not lineage-shaped and raise :class:`DerivaMLException`.	required
`depth`	`int \| None`	Number of parent levels to walk from the immediate producing execution. `None` (default) walks to the root. `0` returns only the immediate producing execution node. `N>0` walks `N` levels up.	`None`
`max_executions`	`int`	Defensive cap on distinct executions the walk will expand. Default 500. If exceeded, `walked_complete` is set to False and the partial tree is returned.	`500`

Returns:

Name	Type	Description
`A`	`'LineageResult'`	class:`~deriva_ml.execution.lineage.LineageResult`
	`'LineageResult'`	with the producing-execution tree plus transparency
	`'LineageResult'`	flags (`walked_complete`, `cycle_detected`,
	`'LineageResult'`	`depth_capped`, `executions_visited`).

Raises:

Type	Description
`DerivaMLException`	If `rid` does not exist, points at a Workflow row, or points at a row whose table cannot be classified as Dataset / Asset / Feature value / Execution.

Example

Trace an output asset back to its training dataset::

>>> result = ml.lookup_lineage("3JSE")  # doctest: +SKIP
>>> assert result.walked_complete  # doctest: +SKIP
>>> for ds in result.lineage.consumed_datasets:  # doctest: +SKIP
...     print(ds.rid, ds.version)

Just the immediate producer (one round-trip)::

>>> result = ml.lookup_lineage(  # doctest: +SKIP
...     "3JSE", depth=0,
... )
>>> producer = result.lineage.execution  # doctest: +SKIP

For the orchestration view (which execution called which), use record.list_execution_parents() / list_execution_children() on an :class:~deriva_ml.execution.execution_record.ExecutionRecord.

Source code in src/deriva_ml/core/mixins/execution.py

def lookup_lineage(
    self,
    rid: RID,
    *,
    depth: int | None = None,
    max_executions: int = 500,
) -> "LineageResult":
    """Walk the data-flow provenance chain behind an artifact.

    Given a Dataset, Asset, Feature value, or Execution RID,
    returns a tree of producing executions and their consumed
    inputs back to the natural root of every branch. Replaces
    what would otherwise be 5-15 client round-trips through
    typed read methods with one call.

    The walk follows **data-flow parents only**: for each
    execution node, the parents are the producing executions of
    its consumed datasets and assets (asset_role="Input"). This
    method explicitly does NOT walk ``Execution_Execution``
    (orchestration links) — that's a different question
    (``ExecutionRecord.list_execution_parents`` /
    ``list_execution_children``). See
    ``docs/adr/0001-lineage-walks-data-flow-not-orchestration.md``
    for the rationale.

    For Dataset roots, the producer is taken from the **current**
    version's ``Dataset_Version.Execution`` row. Walking a
    historical version is a future enhancement.

    Args:
        rid: RID of any Dataset, Asset, Feature value, or
            Execution. Workflow RIDs are not lineage-shaped and
            raise :class:`DerivaMLException`.
        depth: Number of parent levels to walk from the immediate
            producing execution. ``None`` (default) walks to the
            root. ``0`` returns only the immediate producing
            execution node. ``N>0`` walks ``N`` levels up.
        max_executions: Defensive cap on distinct executions the
            walk will expand. Default 500. If exceeded,
            ``walked_complete`` is set to False and the partial
            tree is returned.

    Returns:
        A :class:`~deriva_ml.execution.lineage.LineageResult`
        with the producing-execution tree plus transparency
        flags (``walked_complete``, ``cycle_detected``,
        ``depth_capped``, ``executions_visited``).

    Raises:
        DerivaMLException: If ``rid`` does not exist, points at a
            Workflow row, or points at a row whose table cannot
            be classified as Dataset / Asset / Feature value /
            Execution.

    Example:
        Trace an output asset back to its training dataset::

            >>> result = ml.lookup_lineage("3JSE")  # doctest: +SKIP
            >>> assert result.walked_complete  # doctest: +SKIP
            >>> for ds in result.lineage.consumed_datasets:  # doctest: +SKIP
            ...     print(ds.rid, ds.version)

        Just the immediate producer (one round-trip)::

            >>> result = ml.lookup_lineage(  # doctest: +SKIP
            ...     "3JSE", depth=0,
            ... )
            >>> producer = result.lineage.execution  # doctest: +SKIP

        For the orchestration view (which execution called
        which), use ``record.list_execution_parents()`` /
        ``list_execution_children()`` on an
        :class:`~deriva_ml.execution.execution_record.ExecutionRecord`.
    """
    from deriva_ml.execution.lineage import LineageResult

    # 1. Classify the root RID with a single resolve_rid call.
    root_descriptor, producer_rid = self._classify_rid(rid)

    if producer_rid is None:
        # No producer — return a valid result with an empty walk.
        return LineageResult(root=root_descriptor)

    # 2. Walk iteratively from the producing execution.
    visited_global: set[RID] = set()
    in_progress: set[RID] = set()
    flags = {"cycle_detected": False, "depth_capped": False, "walked_complete": True}

    lineage_root_node = self._walk_node(
        execution_rid=producer_rid,
        depth_remaining=depth,
        max_executions=max_executions,
        visited_global=visited_global,
        in_progress=in_progress,
        flags=flags,
    )

    # The producing-execution summary on the root descriptor matches
    # the top-most execution node we just expanded.
    if lineage_root_node is not None:
        root_descriptor = root_descriptor.model_copy(update={"producing_execution": lineage_root_node.execution})

    return LineageResult(
        root=root_descriptor,
        lineage=lineage_root_node,
        executions_visited=len(visited_global),
        walked_complete=flags["walked_complete"],
        cycle_detected=flags["cycle_detected"],
        depth_capped=flags["depth_capped"],
    )

lookup_term

lookup_term(
    table: str | Table, term_name: str
) -> VocabularyTermHandle

Finds a term in a vocabulary table.

Searches for a term in the specified vocabulary table, matching either the primary name or any of its synonyms. Results are cached for performance - subsequent lookups in the same vocabulary table are served from cache.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Vocabulary table to search in (name or Table object).	required
`term_name`	`str`	Name or synonym of the term to find.	required

Returns:

Name	Type	Description
`VocabularyTermHandle`	`VocabularyTermHandle`	The matching vocabulary term, with methods to modify it.

Raises:

Type	Description
`DerivaMLTableTypeError`	If the table is not a vocabulary table.
`DerivaMLInvalidTerm`	If `term_name` is not a valid term or synonym of any row in the vocabulary table.

Examples:

Look up by primary name: >>> term = ml.lookup_term("tissue_types", "epithelial") # doctest: +SKIP >>> print(term.description) # doctest: +SKIP

Look up by synonym: >>> term = ml.lookup_term("tissue_types", "epithelium") # doctest: +SKIP

Modify the term: >>> term = ml.lookup_term("tissue_types", "epithelial") # doctest: +SKIP >>> term.description = "Updated description" # doctest: +SKIP >>> term.synonyms = ("epithelium", "epithelial_tissue") # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/vocabulary.py

@validate_call(config=VALIDATION_CONFIG)
def lookup_term(self, table: str | Table, term_name: str) -> VocabularyTermHandle:
    """Finds a term in a vocabulary table.

    Searches for a term in the specified vocabulary table, matching either the primary name
    or any of its synonyms. Results are cached for performance - subsequent lookups in the
    same vocabulary table are served from cache.

    Args:
        table: Vocabulary table to search in (name or Table object).
        term_name: Name or synonym of the term to find.

    Returns:
        VocabularyTermHandle: The matching vocabulary term, with methods to modify it.

    Raises:
        DerivaMLTableTypeError: If the table is not a vocabulary table.
        DerivaMLInvalidTerm: If ``term_name`` is not a valid term or synonym
            of any row in the vocabulary table.

    Examples:
        Look up by primary name:
            >>> term = ml.lookup_term("tissue_types", "epithelial")  # doctest: +SKIP
            >>> print(term.description)  # doctest: +SKIP

        Look up by synonym:
            >>> term = ml.lookup_term("tissue_types", "epithelium")  # doctest: +SKIP

        Modify the term:
            >>> term = ml.lookup_term("tissue_types", "epithelial")  # doctest: +SKIP
            >>> term.description = "Updated description"  # doctest: +SKIP
            >>> term.synonyms = ("epithelium", "epithelial_tissue")  # doctest: +SKIP
    """
    # Get and validate vocabulary table reference. Mirror
    # add_term's typed guard so all three "not a vocabulary
    # table" call sites raise the same exception class.
    vocab_table = self.model.name_to_table(table)
    if not self.model.is_vocabulary(vocab_table):
        raise DerivaMLTableTypeError("vocabulary", vocab_table.name)

    # Get schema and table names
    schema_name, table_name = vocab_table.schema.name, vocab_table.name
    cache_key = (schema_name, table_name)

    # Check cache first
    cache = self._get_vocab_cache()
    if cache_key in cache:
        term_lookup = cache[cache_key]
        if term_name in term_lookup:
            return term_lookup[term_name]
        # Term not in cache - might be newly added, try server-side lookup
    else:
        # Vocabulary not cached yet - try server-side lookup first for single term
        term = self._server_lookup_term(schema_name, table_name, term_name)
        if term is not None:
            # Found it - populate the full cache for future lookups
            self._populate_vocab_cache(schema_name, table_name)
            return self._get_vocab_cache()[cache_key][term_name]
        # Not found by name - need to check synonyms, populate cache
        term_lookup = self._populate_vocab_cache(schema_name, table_name)
        if term_name in term_lookup:
            return term_lookup[term_name]
        raise DerivaMLInvalidTerm(table_name, term_name)

    # Term not in cache - try server-side lookup (might be newly added)
    term = self._server_lookup_term(schema_name, table_name, term_name)
    if term is not None:
        # Refresh cache to get the VocabularyTermHandle
        self._populate_vocab_cache(schema_name, table_name)
        return self._get_vocab_cache()[cache_key][term_name]

    # Still not found - refresh cache and try one more time
    term_lookup = self._populate_vocab_cache(schema_name, table_name)
    if term_name in term_lookup:
        return term_lookup[term_name]

    # Term not found
    raise DerivaMLInvalidTerm(table_name, term_name)

lookup_workflow

lookup_workflow(rid: RID) -> Workflow

Look up a workflow by its Resource Identifier (RID).

Retrieves a workflow from the catalog by its RID and returns a Workflow object bound to the catalog. The returned Workflow can be modified (e.g., updating its description) and changes will be reflected in the catalog.

Parameters:

Name	Type	Description	Default
`rid`	`RID`	Resource Identifier of the workflow to look up.	required

Returns:

Name	Type	Description
`Workflow`	`Workflow`	The workflow object bound to this catalog, allowing properties like `description` to be updated.

Raises:

Type	Description
`DerivaMLException`	If the RID does not correspond to a workflow in the catalog.

Examples:

Look up a workflow and read its properties::

>>> workflow = ml.lookup_workflow("2-ABC1")  # doctest: +SKIP
>>> print(f"Name: {workflow.name}")  # doctest: +SKIP
>>> print(f"Description: {workflow.description}")  # doctest: +SKIP
>>> print(f"Type: {workflow.workflow_type}")  # doctest: +SKIP

Update a workflow's description (persisted to catalog)::

>>> workflow = ml.lookup_workflow("2-ABC1")  # doctest: +SKIP
>>> workflow.description = "Updated analysis pipeline for RNA sequences"  # doctest: +SKIP
>>> # The change is immediately written to the catalog

Attempting to update on a read-only catalog raises an error::

>>> snapshot = ml.catalog_snapshot("2023-01-15T10:30:00")  # doctest: +SKIP
>>> workflow = snapshot.lookup_workflow("2-ABC1")  # doctest: +SKIP
>>> workflow.description = "New description"  # doctest: +SKIP
DerivaMLException: Cannot update workflow description on a read-only
    catalog snapshot. Use a writable catalog connection instead.

Source code in src/deriva_ml/core/mixins/workflow.py

def lookup_workflow(self, rid: RID) -> Workflow:
    """Look up a workflow by its Resource Identifier (RID).

    Retrieves a workflow from the catalog by its RID and returns a Workflow
    object bound to the catalog. The returned Workflow can be modified (e.g.,
    updating its description) and changes will be reflected in the catalog.

    Args:
        rid: Resource Identifier of the workflow to look up.

    Returns:
        Workflow: The workflow object bound to this catalog, allowing
            properties like ``description`` to be updated.

    Raises:
        DerivaMLException: If the RID does not correspond to a workflow
            in the catalog.

    Examples:
        Look up a workflow and read its properties::

            >>> workflow = ml.lookup_workflow("2-ABC1")  # doctest: +SKIP
            >>> print(f"Name: {workflow.name}")  # doctest: +SKIP
            >>> print(f"Description: {workflow.description}")  # doctest: +SKIP
            >>> print(f"Type: {workflow.workflow_type}")  # doctest: +SKIP

        Update a workflow's description (persisted to catalog)::

            >>> workflow = ml.lookup_workflow("2-ABC1")  # doctest: +SKIP
            >>> workflow.description = "Updated analysis pipeline for RNA sequences"  # doctest: +SKIP
            >>> # The change is immediately written to the catalog

        Attempting to update on a read-only catalog raises an error::

            >>> snapshot = ml.catalog_snapshot("2023-01-15T10:30:00")  # doctest: +SKIP
            >>> workflow = snapshot.lookup_workflow("2-ABC1")  # doctest: +SKIP
            >>> workflow.description = "New description"  # doctest: +SKIP
            DerivaMLException: Cannot update workflow description on a read-only
                catalog snapshot. Use a writable catalog connection instead.
    """
    # Get the workflow table path
    workflow_path = self.pathBuilder().schemas[self.ml_schema].Workflow

    # Filter by RID
    records = list(workflow_path.filter(workflow_path.RID == rid).entities().fetch())

    if not records:
        raise DerivaMLException(f"Workflow with RID '{rid}' not found in the catalog")

    w = records[0]
    workflow_types = self._get_workflow_types_for_rid(w["RID"])
    workflow = Workflow(
        name=w["Name"],
        url=w["URL"],
        workflow_type=workflow_types,
        version=w["Version"],
        description=w["Description"],
        workflow_rid=w["RID"],
        checksum=w["Checksum"],
    )
    # Bind the workflow to this catalog instance for write-back support
    workflow._ml_instance = self  # type: ignore[assignment]
    return workflow

lookup_workflow_by_url

lookup_workflow_by_url(
    url_or_checksum: str,
) -> Workflow

Look up a workflow by URL or checksum and return the full Workflow object.

Searches for a workflow in the catalog that matches the given URL or checksum and returns a Workflow object bound to the catalog. This allows you to both identify a workflow by its source code location and modify its properties (e.g., description).

The URL should be a GitHub URL pointing to the specific version of the workflow source code. The format typically includes the commit hash::

https://github.com/org/repo/blob/<commit_hash>/path/to/workflow.py

Alternatively, you can search by the Git object hash (checksum) of the workflow file.

Parameters:

Name	Type	Description	Default
`url_or_checksum`	`str`	GitHub URL with commit hash, or Git object hash (checksum) of the workflow file.	required

Returns:

Name	Type	Description
`Workflow`	`Workflow`	The workflow object bound to this catalog, allowing properties like `description` to be updated.

Raises:

Type	Description
`DerivaMLException`	If no workflow with the given URL or checksum is found in the catalog.

Examples:

Look up a workflow by its GitHub URL::

>>> url = "https://github.com/org/repo/blob/abc123/analysis.py"  # doctest: +SKIP
>>> workflow = ml.lookup_workflow_by_url(url)  # doctest: +SKIP
>>> print(f"Found: {workflow.name}")  # doctest: +SKIP
>>> print(f"Version: {workflow.version}")  # doctest: +SKIP

Look up by Git object hash (checksum)::

>>> workflow = ml.lookup_workflow_by_url("abc123def456789...")  # doctest: +SKIP
>>> print(f"Name: {workflow.name}")  # doctest: +SKIP
>>> print(f"URL: {workflow.url}")  # doctest: +SKIP

Update the workflow's description after lookup::

>>> workflow = ml.lookup_workflow_by_url(url)  # doctest: +SKIP
>>> workflow.description = "Updated analysis pipeline"  # doctest: +SKIP
>>> # The change is persisted to the catalog

Typical GitHub URL formats supported::

# Full blob URL with commit hash
https://github.com/org/repo/blob/abc123def/src/workflow.py

# The URL is matched exactly, so ensure it matches what was
# recorded when the workflow was registered

Source code in src/deriva_ml/core/mixins/workflow.py

def lookup_workflow_by_url(self, url_or_checksum: str) -> Workflow:
    """Look up a workflow by URL or checksum and return the full Workflow object.

    Searches for a workflow in the catalog that matches the given URL or
    checksum and returns a Workflow object bound to the catalog. This allows
    you to both identify a workflow by its source code location and modify
    its properties (e.g., description).

    The URL should be a GitHub URL pointing to the specific version of the
    workflow source code. The format typically includes the commit hash::

        https://github.com/org/repo/blob/<commit_hash>/path/to/workflow.py

    Alternatively, you can search by the Git object hash (checksum) of the
    workflow file.

    Args:
        url_or_checksum: GitHub URL with commit hash, or Git object hash
            (checksum) of the workflow file.

    Returns:
        Workflow: The workflow object bound to this catalog, allowing
            properties like ``description`` to be updated.

    Raises:
        DerivaMLException: If no workflow with the given URL or checksum
            is found in the catalog.

    Examples:
        Look up a workflow by its GitHub URL::

            >>> url = "https://github.com/org/repo/blob/abc123/analysis.py"  # doctest: +SKIP
            >>> workflow = ml.lookup_workflow_by_url(url)  # doctest: +SKIP
            >>> print(f"Found: {workflow.name}")  # doctest: +SKIP
            >>> print(f"Version: {workflow.version}")  # doctest: +SKIP

        Look up by Git object hash (checksum)::

            >>> workflow = ml.lookup_workflow_by_url("abc123def456789...")  # doctest: +SKIP
            >>> print(f"Name: {workflow.name}")  # doctest: +SKIP
            >>> print(f"URL: {workflow.url}")  # doctest: +SKIP

        Update the workflow's description after lookup::

            >>> workflow = ml.lookup_workflow_by_url(url)  # doctest: +SKIP
            >>> workflow.description = "Updated analysis pipeline"  # doctest: +SKIP
            >>> # The change is persisted to the catalog

        Typical GitHub URL formats supported::

            # Full blob URL with commit hash
            https://github.com/org/repo/blob/abc123def/src/workflow.py

            # The URL is matched exactly, so ensure it matches what was
            # recorded when the workflow was registered
    """
    # Find the RID first
    rid = self._find_workflow_rid_by_url(url_or_checksum)
    if rid is None:
        raise DerivaMLException(f"Workflow with URL or checksum '{url_or_checksum}' not found in the catalog")

    # Use lookup_workflow to get the full object with catalog binding
    return self.lookup_workflow(rid)

pathBuilder

pathBuilder() -> (
    "datapath._CatalogWrapper"
)

Returns catalog path builder for queries.

The path builder provides a fluent interface for constructing complex queries against the catalog. This is a core component used by many other methods to interact with the catalog.

Returns:

Type	Description
`'datapath._CatalogWrapper'`	datapath._CatalogWrapper: A new instance of the catalog path builder.

Raises:

Type	Description
`Exception`	If the catalog connection is unavailable.

Example

pb = ml.pathBuilder() # doctest: +SKIP path = pb.schemas['my_schema'].tables['my_table'] # doctest: +SKIP results = path.entities().fetch() # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/path_builder.py

def pathBuilder(self) -> "datapath._CatalogWrapper":
    """Returns catalog path builder for queries.

    The path builder provides a fluent interface for constructing complex queries against the catalog.
    This is a core component used by many other methods to interact with the catalog.

    Returns:
        datapath._CatalogWrapper: A new instance of the catalog path builder.

    Raises:
        Exception: If the catalog connection is unavailable.

    Example:
        >>> pb = ml.pathBuilder()  # doctest: +SKIP
        >>> path = pb.schemas['my_schema'].tables['my_table']  # doctest: +SKIP
        >>> results = path.entities().fetch()  # doctest: +SKIP
    """
    return self.catalog.getPathBuilder()

pending_summary

pending_summary() -> (
    "WorkspacePendingSummary"
)

Workspace-wide pending-upload summary.

Queries every known-local execution and returns a WorkspacePendingSummary aggregating per-execution snapshots. Useful for standalone uploader processes that want to know what's pending across runs.

Returns:

Type	Description
`'WorkspacePendingSummary'`	WorkspacePendingSummary with one PendingSummary per execution
`'WorkspacePendingSummary'`	that has at least one registry row.

Example

print(ml.pending_summary().render()) # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/execution.py

def pending_summary(self) -> "WorkspacePendingSummary":
    """Workspace-wide pending-upload summary.

    Queries every known-local execution and returns a
    WorkspacePendingSummary aggregating per-execution snapshots.
    Useful for standalone uploader processes that want to know
    what's pending across runs.

    Returns:
        WorkspacePendingSummary with one PendingSummary per execution
        that has at least one registry row.

    Example:
        >>> print(ml.pending_summary().render())  # doctest: +SKIP
    """
    from deriva_ml.execution.pending_summary import WorkspacePendingSummary

    summaries = []
    for rec in self.list_executions():
        summaries.append(rec.pending_summary(ml=self))
    return WorkspacePendingSummary(per_execution=summaries)

pin_schema

pin_schema(
    reason: str | None = None,
) -> "SchemaDiff | None"

Freeze the local schema cache at its current snapshot.

While pinned, :meth:refresh_schema refuses to update the cache (even with force=True). Call :meth:unpin_schema to clear the pin.

Online mode additionally checks for structural drift: if the live catalog has moved on and its /schema payload differs from the cached one (columns, tables, foreign keys, etc.), a :class:SchemaDiff describing the drift is returned, and a WARNING is logged. The pin is still persisted.

Offline mode always returns None — the cache is pinned, but no live comparison is possible.

Parameters:

Name	Type	Description	Default
`reason`	`str \| None`	Free-text explanation stored alongside the pin. Useful for reporting (`pin_status().pin_reason`).	`None`

Returns:

Name	Type	Description
`A`	`'SchemaDiff \| None'`	class:`SchemaDiff` when the pin is applied online and
	`'SchemaDiff \| None'`	the live catalog's schema differs structurally from the
	`'SchemaDiff \| None'`	cache. `None` otherwise (offline, no drift, or snapshot
	`'SchemaDiff \| None'`	bumped without schema change).

Raises:

Type	Description
`FileNotFoundError`	If the workspace has no cache yet. Run an online `DerivaML.__init__` or :meth:`refresh_schema` first.

Source code in src/deriva_ml/core/base.py

def pin_schema(self, reason: str | None = None) -> "SchemaDiff | None":
    """Freeze the local schema cache at its current snapshot.

    While pinned, :meth:`refresh_schema` refuses to update the
    cache (even with ``force=True``). Call :meth:`unpin_schema`
    to clear the pin.

    Online mode additionally checks for structural drift: if the
    live catalog has moved on and its ``/schema`` payload differs
    from the cached one (columns, tables, foreign keys, etc.),
    a :class:`SchemaDiff` describing the drift is returned, and
    a WARNING is logged. The pin is still persisted.

    Offline mode always returns ``None`` — the cache is pinned,
    but no live comparison is possible.

    Args:
        reason: Free-text explanation stored alongside the pin.
            Useful for reporting (``pin_status().pin_reason``).

    Returns:
        A :class:`SchemaDiff` when the pin is applied online and
        the live catalog's schema differs structurally from the
        cache. ``None`` otherwise (offline, no drift, or snapshot
        bumped without schema change).

    Raises:
        FileNotFoundError: If the workspace has no cache yet.
            Run an online ``DerivaML.__init__`` or
            :meth:`refresh_schema` first.
    """
    from deriva_ml.core.schema_diff import _compute_diff

    cache = SchemaCache(self.working_dir)
    drift: SchemaDiff | None = None
    if self._mode is ConnectionMode.online:
        live_snapshot_id = self.catalog.get("/").json()["snaptime"]
        cached_payload = cache.load()
        if cached_payload["snapshot_id"] != live_snapshot_id:
            # See refresh_schema for the purge+get rationale.
            self.catalog.purge_cache_by_prefix("/schema")
            live_schema = self.catalog.getCatalogSchema()
            diff = _compute_diff(cached_payload["schema"], live_schema)
            if not diff.is_empty():
                logger.warning(
                    "pin_schema: cache at %s, live at %s; structural drift detected (see returned SchemaDiff)",
                    cached_payload["snapshot_id"],
                    live_snapshot_id,
                )
                drift = diff
    cache.pin(reason=reason)
    return drift

pin_status

pin_status() -> 'PinStatus'

Return the current pin state of the local schema cache.

Works in any mode.

Returns:

Name	Type	Description
`A`	`'PinStatus'`	class:`PinStatus` snapshot: `pinned` flag, UTC
	`'PinStatus'`	`pinned_at` timestamp (or None), caller-supplied
	`'PinStatus'`	`pin_reason` (or None), and the cache's current
	`'PinStatus'`	`pinned_snapshot_id`.

Raises:

Type	Description
`FileNotFoundError`	If the workspace has no cache file.

Source code in src/deriva_ml/core/base.py

def pin_status(self) -> "PinStatus":
    """Return the current pin state of the local schema cache.

    Works in any mode.

    Returns:
        A :class:`PinStatus` snapshot: ``pinned`` flag, UTC
        ``pinned_at`` timestamp (or None), caller-supplied
        ``pin_reason`` (or None), and the cache's current
        ``pinned_snapshot_id``.

    Raises:
        FileNotFoundError: If the workspace has no cache file.
    """
    return SchemaCache(self.working_dir).pin_status()

refresh_schema

refresh_schema(
    *, force: bool = False
) -> None

Force-refetch the live catalog schema; rebuild the model and disk cache.

Online mode only. For normal in-process use this method is rarely needed: deriva-py's ErmrestCatalog already handles schema freshness automatically (auto-invalidation on same- instance mutations, If-None-Match revalidation on every read). Use refresh_schema() when you specifically need to bypass the in-process cache and re-fetch from the live catalog -- e.g. after a known out-of-band mutation by another process, or before overwriting the offline-mode disk cache.

The disk cache (SchemaCache) is rewritten with the fresh /schema JSON so subsequent offline-mode reads see the new snapshot. The in-memory self.model is rebuilt from the same fresh JSON.

Refuses in two cases:

The disk cache is pinned (via :meth:pin_schema). Raises :class:DerivaMLSchemaPinned. force=True does NOT bypass a pin — call :meth:unpin_schema first.
The workspace has pending rows (staged/leasing/leased/ uploading/failed). Raises :class:DerivaMLSchemaRefreshBlocked unless force=True is passed; a forced refresh may leave staged rows whose metadata references columns or types no longer in the new schema, causing catalog-insert failures on the next upload.

Parameters:

Name	Type	Description	Default
`force`	`bool`	If True, refresh even when the workspace has pending rows. Does NOT bypass a pin.	`False`

Raises:

Type	Description
`DerivaMLOfflineError`	If called in offline mode.
`DerivaMLSchemaPinned`	If the disk cache is pinned (any `force` value).
`DerivaMLSchemaRefreshBlocked`	If `force=False` and the workspace has pending rows (and the cache is not pinned).

Source code in src/deriva_ml/core/base.py

def refresh_schema(self, *, force: bool = False) -> None:
    """Force-refetch the live catalog schema; rebuild the model and disk cache.

    Online mode only. For normal in-process use this method is
    rarely needed: deriva-py's ``ErmrestCatalog`` already handles
    schema freshness automatically (auto-invalidation on same-
    instance mutations, ``If-None-Match`` revalidation on every
    read). Use ``refresh_schema()`` when you specifically need to
    bypass the in-process cache and re-fetch from the live
    catalog -- e.g. after a known out-of-band mutation by another
    process, or before overwriting the offline-mode disk cache.

    The disk cache (``SchemaCache``) is rewritten with the fresh
    ``/schema`` JSON so subsequent offline-mode reads see the new
    snapshot. The in-memory ``self.model`` is rebuilt from the
    same fresh JSON.

    Refuses in two cases:

    1. The disk cache is pinned (via :meth:`pin_schema`). Raises
       :class:`DerivaMLSchemaPinned`. ``force=True`` does NOT
       bypass a pin — call :meth:`unpin_schema` first.
    2. The workspace has pending rows (staged/leasing/leased/
       uploading/failed). Raises
       :class:`DerivaMLSchemaRefreshBlocked` unless ``force=True``
       is passed; a forced refresh may leave staged rows whose
       metadata references columns or types no longer in the
       new schema, causing catalog-insert failures on the next
       upload.

    Args:
        force: If True, refresh even when the workspace has
            pending rows. Does NOT bypass a pin.

    Raises:
        DerivaMLOfflineError: If called in offline mode.
        DerivaMLSchemaPinned: If the disk cache is pinned (any
            ``force`` value).
        DerivaMLSchemaRefreshBlocked: If ``force=False`` and the
            workspace has pending rows (and the cache is not
            pinned).
    """
    from deriva_ml.model.catalog import DerivaModel

    if self._mode is not ConnectionMode.online:
        raise DerivaMLOfflineError("refresh_schema requires online mode")
    cache = SchemaCache(self.working_dir)
    if cache.exists() and cache.pin_status().pinned:
        pin_info = cache.pin_status()
        raise DerivaMLSchemaPinned(
            f"refresh_schema refused: cache is pinned at snapshot "
            f"{pin_info.pinned_snapshot_id}"
            + (f" (reason: {pin_info.pin_reason})" if pin_info.pin_reason else "")
            + ". Call ml.unpin_schema() first."
        )
    store = self.workspace.execution_state_store()
    count = store.count_pending_rows()
    if count > 0 and not force:
        raise DerivaMLSchemaRefreshBlocked(
            f"refresh_schema requires a drained workspace; "
            f"{count} pending rows. Run ml.commit_pending_executions() first, "
            f"or call refresh_schema(force=True) to discard local "
            f"state (staged rows may become inconsistent with the "
            f"new schema)."
        )
    # Force a refetch through deriva-py's binding cache; rebuilds
    # the parsed-dict and path-builder caches on the catalog.
    # ``getCatalogSchema()`` is conditionally revalidated via
    # ``If-None-Match`` on every call, so an external mutation is
    # naturally observed; purging the prefix here guarantees the
    # refetch even when the ETag has not changed (defensive: covers
    # cases where the server lies about ETag stability or where the
    # caller explicitly wants the parsed-dict cache invalidated).
    self.catalog.purge_cache_by_prefix("/schema")
    live_schema = self.catalog.getCatalogSchema()
    live_snapshot_id = self.catalog.get("/").json()["snaptime"]
    cache.write(
        snapshot_id=live_snapshot_id,
        hostname=self.host_name,
        catalog_id=str(self.catalog_id),
        ml_schema=self.model.ml_schema,
        schema=live_schema,
    )
    # Reload the in-memory model so this session sees the new schema.
    self.model = DerivaModel.from_cached(
        live_schema,
        catalog=self.catalog,
        ml_schema=self.model.ml_schema,
        domain_schemas=self.model.domain_schemas,
        default_schema=self.model.default_schema,
    )
    logger.info("schema refreshed to snapshot %s", live_snapshot_id)

remove_visible_column

remove_visible_column(
    table: str | Table,
    context: str,
    column: str | list[str] | int,
) -> list[Any]

Remove a column from the visible-columns list for a specific context.

Convenience method for removing columns without replacing the entire visible-columns annotation. Changes are staged until apply_annotations() is called.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Table name (str) or `Table` object.	required
`context`	`str`	The context to modify (e.g., `"compact"`, `"detailed"`).	required
`column`	`str \| list[str] \| int`	Column to remove. Can be: - str: column name to find and remove - list: foreign key reference `[schema, constraint]` to find and remove - int: index position to remove (0-indexed)	required

Returns:

Type	Description
`list[Any]`	The updated column list for the context.

Raises:

Type	Description
`DerivaMLTableTypeError`	If `table` is not found in the catalog model.
`DerivaMLException`	If the annotation or context doesn't exist, or the column is not found.

Example

ml.remove_visible_column("Image", "compact", "Description") # doctest: +SKIP ml.remove_visible_column("Image", "compact", 0) # doctest: +SKIP ml.apply_annotations() # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/annotation.py

@validate_call(config=VALIDATION_CONFIG)
def remove_visible_column(
    self,
    table: str | Table,
    context: str,
    column: str | list[str] | int,
) -> list[Any]:
    """Remove a column from the visible-columns list for a specific context.

    Convenience method for removing columns without replacing the entire
    visible-columns annotation. Changes are staged until
    ``apply_annotations()`` is called.

    Args:
        table: Table name (str) or ``Table`` object.
        context: The context to modify (e.g., ``"compact"``,
            ``"detailed"``).
        column: Column to remove. Can be:
            - str: column name to find and remove
            - list: foreign key reference ``[schema, constraint]`` to
              find and remove
            - int: index position to remove (0-indexed)

    Returns:
        The updated column list for the context.

    Raises:
        DerivaMLTableTypeError: If ``table`` is not found in the catalog model.
        DerivaMLException: If the annotation or context doesn't exist, or
            the column is not found.

    Example:
        >>> ml.remove_visible_column("Image", "compact", "Description")  # doctest: +SKIP
        >>> ml.remove_visible_column("Image", "compact", 0)  # doctest: +SKIP
        >>> ml.apply_annotations()  # doctest: +SKIP
    """
    table_obj = self.model.name_to_table(table)

    # Get visible_columns annotation
    visible_cols = table_obj.annotations.get(VISIBLE_COLUMNS_TAG, {})
    if not visible_cols:
        raise DerivaMLException(f"Table '{table_obj.name}' has no visible-columns annotation.")

    # Get the context list
    context_list = visible_cols.get(context)
    if context_list is None:
        raise DerivaMLException(f"Context '{context}' not found in visible-columns annotation.")
    if isinstance(context_list, str):
        raise DerivaMLException(
            f"Context '{context}' references another context '{context_list}'. "
            "Set it explicitly first with set_visible_columns()."
        )

    # Make a copy
    context_list = list(context_list)
    removed = None

    # Remove by index or by value
    if isinstance(column, int):
        if 0 <= column < len(context_list):
            removed = context_list.pop(column)
        else:
            raise DerivaMLException(f"Index {column} out of range (list has {len(context_list)} items).")
    else:
        # Find and remove the column
        for i, item in enumerate(context_list):
            if item == column:
                removed = context_list.pop(i)
                break
            # Also check if it's a pseudo-column with matching source
            if isinstance(item, dict) and isinstance(column, str):
                if item.get("source") == column:
                    removed = context_list.pop(i)
                    break

        if removed is None:
            raise DerivaMLException(f"Column {column!r} not found in context '{context}'.")

    # Update the annotation
    visible_cols[context] = context_list
    table_obj.annotations[VISIBLE_COLUMNS_TAG] = visible_cols

    return context_list

remove_visible_foreign_key

remove_visible_foreign_key(
    table: str | Table,
    context: str,
    foreign_key: list[str] | int,
) -> list[Any]

Remove a foreign key from the visible-foreign-keys list for a specific context.

Convenience method for removing related tables without replacing the entire visible-foreign-keys annotation. Changes are staged until apply_annotations() is called.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Table name (str) or `Table` object.	required
`context`	`str`	The context to modify (e.g., `"detailed"`, `"*"`).	required
`foreign_key`	`list[str] \| int`	Foreign key to remove. Can be: - list: FK reference `[schema, constraint]` to find and remove - int: index position to remove (0-indexed)	required

Returns:

Type	Description
`list[Any]`	The updated foreign key list for the context.

Raises:

Type	Description
`DerivaMLTableTypeError`	If `table` is not found in the catalog model.
`DerivaMLException`	If the annotation or context doesn't exist, or the foreign key is not found.

Example

ml.remove_visible_foreign_key("Subject", "detailed", ["domain", "Image_Subject_fkey"]) # doctest: +SKIP ml.remove_visible_foreign_key("Subject", "detailed", 0) # doctest: +SKIP ml.apply_annotations() # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/annotation.py

@validate_call(config=VALIDATION_CONFIG)
def remove_visible_foreign_key(
    self,
    table: str | Table,
    context: str,
    foreign_key: list[str] | int,
) -> list[Any]:
    """Remove a foreign key from the visible-foreign-keys list for a specific context.

    Convenience method for removing related tables without replacing the
    entire visible-foreign-keys annotation. Changes are staged until
    ``apply_annotations()`` is called.

    Args:
        table: Table name (str) or ``Table`` object.
        context: The context to modify (e.g., ``"detailed"``, ``"*"``).
        foreign_key: Foreign key to remove. Can be:
            - list: FK reference ``[schema, constraint]`` to find and remove
            - int: index position to remove (0-indexed)

    Returns:
        The updated foreign key list for the context.

    Raises:
        DerivaMLTableTypeError: If ``table`` is not found in the catalog model.
        DerivaMLException: If the annotation or context doesn't exist, or
            the foreign key is not found.

    Example:
        >>> ml.remove_visible_foreign_key("Subject", "detailed", ["domain", "Image_Subject_fkey"])  # doctest: +SKIP
        >>> ml.remove_visible_foreign_key("Subject", "detailed", 0)  # doctest: +SKIP
        >>> ml.apply_annotations()  # doctest: +SKIP
    """
    table_obj = self.model.name_to_table(table)

    # Get visible_foreign_keys annotation
    visible_fkeys = table_obj.annotations.get(VISIBLE_FOREIGN_KEYS_TAG, {})
    if not visible_fkeys:
        raise DerivaMLException(f"Table '{table_obj.name}' has no visible-foreign-keys annotation.")

    # Get the context list
    context_list = visible_fkeys.get(context)
    if context_list is None:
        raise DerivaMLException(f"Context '{context}' not found in visible-foreign-keys annotation.")
    if isinstance(context_list, str):
        raise DerivaMLException(
            f"Context '{context}' references another context '{context_list}'. "
            "Set it explicitly first with set_visible_foreign_keys()."
        )

    # Make a copy
    context_list = list(context_list)
    removed = None

    # Remove by index or by value
    if isinstance(foreign_key, int):
        if 0 <= foreign_key < len(context_list):
            removed = context_list.pop(foreign_key)
        else:
            raise DerivaMLException(f"Index {foreign_key} out of range (list has {len(context_list)} items).")
    else:
        # Find and remove the foreign key
        for i, item in enumerate(context_list):
            if item == foreign_key:
                removed = context_list.pop(i)
                break

        if removed is None:
            raise DerivaMLException(f"Foreign key {foreign_key!r} not found in context '{context}'.")

    # Update the annotation
    visible_fkeys[context] = context_list
    table_obj.annotations[VISIBLE_FOREIGN_KEYS_TAG] = visible_fkeys

    return context_list

reorder_visible_columns

reorder_visible_columns(
    table: str | Table,
    context: str,
    new_order: list[int]
    | list[
        str | list[str] | dict[str, Any]
    ],
) -> list[Any]

Reorder columns in the visible-columns list for a specific context.

Convenience method for reordering columns without manually reconstructing the list. Changes are staged until apply_annotations() is called.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Table name (str) or `Table` object.	required
`context`	`str`	The context to modify (e.g., `"compact"`, `"detailed"`).	required
`new_order`	`list[int] \| list[str \| list[str] \| dict[str, Any]]`	The new order specification. Can be: - list of int: `[2, 0, 1, 3]` reorders by current positions - list of column specs: `["Name", "RID", ...]` specifies the exact new order	required

Returns:

Type	Description
`list[Any]`	The reordered column list.

Raises:

Type	Description
`DerivaMLTableTypeError`	If `table` is not found in the catalog model.
`DerivaMLException`	If the annotation or context doesn't exist, or the index list is invalid.

Example

ml.reorder_visible_columns("Image", "compact", [2, 0, 1, 3, 4]) # doctest: +SKIP ml.reorder_visible_columns("Image", "compact", ["Filename", "Subject", "RID"]) # doctest: +SKIP ml.apply_annotations() # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/annotation.py

@validate_call(config=VALIDATION_CONFIG)
def reorder_visible_columns(
    self,
    table: str | Table,
    context: str,
    new_order: list[int] | list[str | list[str] | dict[str, Any]],
) -> list[Any]:
    """Reorder columns in the visible-columns list for a specific context.

    Convenience method for reordering columns without manually
    reconstructing the list. Changes are staged until
    ``apply_annotations()`` is called.

    Args:
        table: Table name (str) or ``Table`` object.
        context: The context to modify (e.g., ``"compact"``,
            ``"detailed"``).
        new_order: The new order specification. Can be:
            - list of int: ``[2, 0, 1, 3]`` reorders by current positions
            - list of column specs: ``["Name", "RID", ...]`` specifies
              the exact new order

    Returns:
        The reordered column list.

    Raises:
        DerivaMLTableTypeError: If ``table`` is not found in the catalog model.
        DerivaMLException: If the annotation or context doesn't exist, or
            the index list is invalid.

    Example:
        >>> ml.reorder_visible_columns("Image", "compact", [2, 0, 1, 3, 4])  # doctest: +SKIP
        >>> ml.reorder_visible_columns("Image", "compact", ["Filename", "Subject", "RID"])  # doctest: +SKIP
        >>> ml.apply_annotations()  # doctest: +SKIP
    """
    table_obj = self.model.name_to_table(table)

    # Get visible_columns annotation
    visible_cols = table_obj.annotations.get(VISIBLE_COLUMNS_TAG, {})
    if not visible_cols:
        raise DerivaMLException(f"Table '{table_obj.name}' has no visible-columns annotation.")

    # Get the context list
    context_list = visible_cols.get(context)
    if context_list is None:
        raise DerivaMLException(f"Context '{context}' not found in visible-columns annotation.")
    if isinstance(context_list, str):
        raise DerivaMLException(
            f"Context '{context}' references another context '{context_list}'. "
            "Set it explicitly first with set_visible_columns()."
        )

    original_list = list(context_list)

    # Determine if new_order is indices or column specs
    if new_order and isinstance(new_order[0], int):
        # Reorder by indices
        if len(new_order) != len(original_list):
            raise DerivaMLException(
                f"Index list length ({len(new_order)}) must match current list length ({len(original_list)})."
            )
        if set(new_order) != set(range(len(original_list))):
            raise DerivaMLException("Index list must contain each index exactly once.")
        new_list = [original_list[i] for i in new_order]
    else:
        # new_order is the exact new column list
        new_list = list(new_order)

    # Update the annotation
    visible_cols[context] = new_list
    table_obj.annotations[VISIBLE_COLUMNS_TAG] = visible_cols

    return new_list

reorder_visible_foreign_keys

reorder_visible_foreign_keys(
    table: str | Table,
    context: str,
    new_order: list[int]
    | list[list[str] | dict[str, Any]],
) -> list[Any]

Reorder foreign keys in the visible-foreign-keys list for a specific context.

Convenience method for reordering related tables without manually reconstructing the list. Changes are staged until apply_annotations() is called.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Table name (str) or `Table` object.	required
`context`	`str`	The context to modify (e.g., `"detailed"`, `"*"`).	required
`new_order`	`list[int] \| list[list[str] \| dict[str, Any]]`	The new order specification. Can be: - list of int: `[2, 0, 1]` reorders by current positions - list of FK refs: `[["schema", "fkey1"], ...]` specifies the exact new order	required

Returns:

Type	Description
`list[Any]`	The reordered foreign key list.

Raises:

Type	Description
`DerivaMLTableTypeError`	If `table` is not found in the catalog model.
`DerivaMLException`	If the annotation or context doesn't exist, or the index list is invalid.

Example

ml.reorder_visible_foreign_keys("Subject", "detailed", [2, 0, 1]) # doctest: +SKIP ml.apply_annotations() # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/annotation.py

@validate_call(config=VALIDATION_CONFIG)
def reorder_visible_foreign_keys(
    self,
    table: str | Table,
    context: str,
    new_order: list[int] | list[list[str] | dict[str, Any]],
) -> list[Any]:
    """Reorder foreign keys in the visible-foreign-keys list for a specific context.

    Convenience method for reordering related tables without manually
    reconstructing the list. Changes are staged until
    ``apply_annotations()`` is called.

    Args:
        table: Table name (str) or ``Table`` object.
        context: The context to modify (e.g., ``"detailed"``, ``"*"``).
        new_order: The new order specification. Can be:
            - list of int: ``[2, 0, 1]`` reorders by current positions
            - list of FK refs: ``[["schema", "fkey1"], ...]`` specifies
              the exact new order

    Returns:
        The reordered foreign key list.

    Raises:
        DerivaMLTableTypeError: If ``table`` is not found in the catalog model.
        DerivaMLException: If the annotation or context doesn't exist, or
            the index list is invalid.

    Example:
        >>> ml.reorder_visible_foreign_keys("Subject", "detailed", [2, 0, 1])  # doctest: +SKIP
        >>> ml.apply_annotations()  # doctest: +SKIP
    """
    table_obj = self.model.name_to_table(table)

    # Get visible_foreign_keys annotation
    visible_fkeys = table_obj.annotations.get(VISIBLE_FOREIGN_KEYS_TAG, {})
    if not visible_fkeys:
        raise DerivaMLException(f"Table '{table_obj.name}' has no visible-foreign-keys annotation.")

    # Get the context list
    context_list = visible_fkeys.get(context)
    if context_list is None:
        raise DerivaMLException(f"Context '{context}' not found in visible-foreign-keys annotation.")
    if isinstance(context_list, str):
        raise DerivaMLException(
            f"Context '{context}' references another context '{context_list}'. "
            "Set it explicitly first with set_visible_foreign_keys()."
        )

    original_list = list(context_list)

    # Determine if new_order is indices or foreign key specs
    if new_order and isinstance(new_order[0], int):
        # Reorder by indices
        if len(new_order) != len(original_list):
            raise DerivaMLException(
                f"Index list length ({len(new_order)}) must match current list length ({len(original_list)})."
            )
        if set(new_order) != set(range(len(original_list))):
            raise DerivaMLException("Index list must contain each index exactly once.")
        new_list = [original_list[i] for i in new_order]
    else:
        # new_order is the exact new foreign key list
        new_list = list(new_order)

    # Update the annotation
    visible_fkeys[context] = new_list
    table_obj.annotations[VISIBLE_FOREIGN_KEYS_TAG] = visible_fkeys

    return new_list

resolve_rid

resolve_rid(
    rid: RID,
) -> ResolveRidResult

Resolves RID to catalog location.

Looks up a RID and returns information about where it exists in the catalog, including schema, table, and column metadata.

Parameters:

Name	Type	Description	Default
`rid`	`RID`	Resource Identifier to resolve.	required

Returns:

Name	Type	Description
`ResolveRidResult`	`ResolveRidResult`	Named tuple containing: - schema: Schema name - table: Table name - columns: Column definitions - datapath: Path builder for accessing the entity

Raises:

Type	Description
`DerivaMLException`	If RID doesn't exist in catalog.

Examples:

>>> result = ml.resolve_rid("1-abc123")
>>> print(f"Found in {result.schema}.{result.table}")
>>> data = result.datapath.entities().fetch()

Source code in src/deriva_ml/core/mixins/rid_resolution.py

def resolve_rid(self, rid: RID) -> ResolveRidResult:
    """Resolves RID to catalog location.

    Looks up a RID and returns information about where it exists in the catalog, including schema,
    table, and column metadata.

    Args:
        rid: Resource Identifier to resolve.

    Returns:
        ResolveRidResult: Named tuple containing:
            - schema: Schema name
            - table: Table name
            - columns: Column definitions
            - datapath: Path builder for accessing the entity

    Raises:
        DerivaMLException: If RID doesn't exist in catalog.

    Examples:
        >>> result = ml.resolve_rid("1-abc123")  # doctest: +SKIP
        >>> print(f"Found in {result.schema}.{result.table}")  # doctest: +SKIP
        >>> data = result.datapath.entities().fetch()  # doctest: +SKIP
    """
    try:
        # Attempt to resolve RID using catalog model
        return self.catalog.resolve_rid(rid, self.model.model)
    except KeyError as _e:
        raise DerivaMLException(f"Invalid RID {rid}")

resolve_rids

resolve_rids(
    rids: set[RID] | list[RID],
    candidate_tables: list[Table]
    | None = None,
) -> dict[RID, BatchRidResult]

Batch resolve multiple RIDs efficiently.

Resolves multiple RIDs in batched queries, significantly faster than calling resolve_rid() for each RID individually. Instead of N network calls for N RIDs, this makes one query per candidate table.

Parameters:

Name	Type	Description	Default
`rids`	`set[RID] \| list[RID]`	Set or list of RIDs to resolve.	required
`candidate_tables`	`list[Table] \| None`	Optional list of Table objects to search in. If not provided, searches all tables in domain and ML schemas.	`None`

Returns:

Type	Description
`dict[RID, BatchRidResult]`	dict[RID, BatchRidResult]: Mapping from each resolved RID to its BatchRidResult containing table information.

Raises:

Type	Description
`DerivaMLException`	If any RID cannot be resolved.

Example

results = ml.resolve_rids(["1-ABC", "2-DEF", "3-GHI"]) # doctest: +SKIP for rid, info in results.items(): # doctest: +SKIP ... print(f"{rid} is in table {info.table_name}")

Source code in src/deriva_ml/core/mixins/rid_resolution.py

def resolve_rids(
    self,
    rids: set[RID] | list[RID],
    candidate_tables: list[Table] | None = None,
) -> dict[RID, BatchRidResult]:
    """Batch resolve multiple RIDs efficiently.

    Resolves multiple RIDs in batched queries, significantly faster than
    calling resolve_rid() for each RID individually. Instead of N network
    calls for N RIDs, this makes one query per candidate table.

    Args:
        rids: Set or list of RIDs to resolve.
        candidate_tables: Optional list of Table objects to search in.
            If not provided, searches all tables in domain and ML schemas.

    Returns:
        dict[RID, BatchRidResult]: Mapping from each resolved RID to its
            BatchRidResult containing table information.

    Raises:
        DerivaMLException: If any RID cannot be resolved.

    Example:
        >>> results = ml.resolve_rids(["1-ABC", "2-DEF", "3-GHI"])  # doctest: +SKIP
        >>> for rid, info in results.items():  # doctest: +SKIP
        ...     print(f"{rid} is in table {info.table_name}")
    """
    rids = set(rids)
    if not rids:
        return {}

    results: dict[RID, BatchRidResult] = {}
    remaining_rids = set(rids)

    # Determine which tables to search
    if candidate_tables is None:
        # Search all tables in domain and ML schemas
        candidate_tables = []
        for schema_name in [*self.model.domain_schemas, self.model.ml_schema]:
            schema = self.model.model.schemas.get(schema_name)
            if schema:
                candidate_tables.extend(schema.tables.values())

    pb = self.pathBuilder()

    # Query each candidate table for matching RIDs
    for table in candidate_tables:
        if not remaining_rids:
            break

        schema_name = table.schema.name
        table_name = table.name

        # Build a query with RID filter for all remaining RIDs
        table_path = pb.schemas[schema_name].tables[table_name]

        # Use ERMrest's Any quantifier for IN-style query
        # Query only for RID column to minimize data transfer
        try:
            # Filter: RID = any(rid1, rid2, ...) - ERMrest's way of doing IN clause
            found_entities = list(
                table_path.filter(table_path.RID == AnyQuantifier(*remaining_rids))
                .attributes(table_path.RID)
                .fetch()
            )
        except Exception as e:
            logger.debug(f"RID resolution query failed for {schema_name}.{table_name}: {e}")
            continue

        # Process found RIDs
        for entity in found_entities:
            rid = entity["RID"]
            if rid in remaining_rids:
                results[rid] = BatchRidResult(
                    rid=rid,
                    table=table,
                    table_name=table_name,
                    schema_name=schema_name,
                )
                remaining_rids.remove(rid)

    # Check if any RIDs were not found. Raise the typed
    # ``DerivaMLRidsNotFound`` so callers can pull the unresolved
    # set off ``e.missing_rids`` without string-parsing the
    # message — see ``DerivaML.validate_rids``, which previously
    # had to grep the message for ``"Invalid RIDs:"`` because
    # this raise site emitted a bare ``DerivaMLException``.
    if remaining_rids:
        raise DerivaMLRidsNotFound(remaining_rids)

    return results

resume_execution

resume_execution(
    execution_rid: RID,
) -> "Execution"

Re-hydrate an Execution from the workspace SQLite registry.

Works in both online and offline modes. The execution's recorded mode is independent of the current DerivaML instance's mode — a user can create an execution online, run it offline, then upload online, all via the same RID.

Before returning, runs just-in-time state reconciliation (spec §2.2): if online and sync_pending=True, flushes SQLite to the catalog; then checks for catalog/SQLite disagreement and applies the disagreement rules.

Parameters:

Name	Type	Description	Default
`execution_rid`	`RID`	Server-assigned Execution RID returned by a prior create_execution call.	required

Returns:

Type	Description
`'Execution'`	An Execution object bound to this DerivaML instance, with
`'Execution'`	lifecycle fields as SQLite read-through properties (see
`'Execution'`	spec §2.3).

Raises:

Type	Description
`DerivaMLException`	If no matching executions row exists in the workspace registry.
`DerivaMLStateInconsistency`	If just-in-time reconciliation surfaces a disagreement outside the six documented cases (see state_machine.reconcile_with_catalog).

Example

ml = DerivaML(hostname="example.org", catalog_id="42") # doctest: +SKIP exe = ml.resume_execution("5-ABC") # doctest: +SKIP exe.status # doctest: +SKIP exe.commit_output_assets() # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/execution.py

def resume_execution(self, execution_rid: RID) -> "Execution":
    """Re-hydrate an Execution from the workspace SQLite registry.

    Works in both online and offline modes. The execution's recorded
    mode is independent of the current DerivaML instance's mode — a
    user can create an execution online, run it offline, then upload
    online, all via the same RID.

    Before returning, runs just-in-time state reconciliation
    (spec §2.2): if online and sync_pending=True, flushes SQLite to
    the catalog; then checks for catalog/SQLite disagreement and
    applies the disagreement rules.

    Args:
        execution_rid: Server-assigned Execution RID returned by a
            prior create_execution call.

    Returns:
        An Execution object bound to this DerivaML instance, with
        lifecycle fields as SQLite read-through properties (see
        spec §2.3).

    Raises:
        DerivaMLException: If no matching executions row exists in
            the workspace registry.
        DerivaMLStateInconsistency: If just-in-time reconciliation
            surfaces a disagreement outside the six documented cases
            (see state_machine.reconcile_with_catalog).

    Example:
        >>> ml = DerivaML(hostname="example.org", catalog_id="42")  # doctest: +SKIP
        >>> exe = ml.resume_execution("5-ABC")  # doctest: +SKIP
        >>> exe.status  # doctest: +SKIP
        <ExecutionStatus.Stopped>
        >>> exe.commit_output_assets()  # doctest: +SKIP
    """
    from deriva_ml.execution.execution import Execution

    store = self.workspace.execution_state_store()
    row = store.get_execution(execution_rid)
    if row is None:
        raise DerivaMLException(
            f"Execution {execution_rid} is not in the workspace registry. "
            f"Either it was never created on this workspace, or it was "
            f"garbage-collected. Use ml.list_executions() to see what's "
            f"available locally."
        )

    # Just-in-time reconciliation. Online only — offline mode has no
    # catalog to compare against.
    if self._mode is ConnectionMode.online:
        # Order matters: flush first (push our newer state) before
        # reconcile (which would otherwise see stale catalog state
        # as a disagreement). See spec §4.6 step 3.
        if row["sync_pending"]:
            flush_pending_sync(
                store=store,
                catalog=self.catalog,
                execution_rid=execution_rid,
            )
        reconcile_with_catalog(
            store=store,
            catalog=self.catalog,
            execution_rid=execution_rid,
        )

    # Construct Execution bound to this DerivaML — it reads lifecycle
    # fields from SQLite via read-through properties (Group E).
    return Execution.from_registry(
        ml_object=self,
        execution_rid=execution_rid,
    )

set_column_display

set_column_display(
    table: str | Table,
    column_name: str,
    annotation: dict[str, Any] | None,
) -> str

Set the column-display annotation on a column.

Controls how a column's values are rendered, including custom formatting and markdown patterns. The annotation dict follows the Chaise column-display tag specification, keyed by context name (or "*" for all contexts), e.g. {"*": {"pre_format": {"format": "%.2f"}}}. Changes are staged locally until apply_annotations() is called.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Table name (str) or `Table` object containing the column.	required
`column_name`	`str`	Name of the column.	required
`annotation`	`dict[str, Any] \| None`	The column-display annotation dict. Set to `None` to remove the annotation.	required

Returns:

Type	Description
`str`	Column identifier as `"Table.column"` (str).

Raises:

Type	Description
`DerivaMLTableTypeError`	If `table` is not found in the catalog model.

Example

ml.set_column_display("Measurement", "Value", { # doctest: +SKIP ... "*": {"pre_format": {"format": "%.2f"}} ... }) ml.apply_annotations() # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/annotation.py

@validate_call(config=VALIDATION_CONFIG)
def set_column_display(
    self,
    table: str | Table,
    column_name: str,
    annotation: dict[str, Any] | None,
) -> str:
    """Set the column-display annotation on a column.

    Controls how a column's values are rendered, including custom
    formatting and markdown patterns. The annotation dict follows the
    Chaise column-display tag specification, keyed by context name
    (or ``"*"`` for all contexts), e.g.
    ``{"*": {"pre_format": {"format": "%.2f"}}}``.
    Changes are staged locally until ``apply_annotations()`` is called.

    Args:
        table: Table name (str) or ``Table`` object containing the column.
        column_name: Name of the column.
        annotation: The column-display annotation dict. Set to ``None``
            to remove the annotation.

    Returns:
        Column identifier as ``"Table.column"`` (str).

    Raises:
        DerivaMLTableTypeError: If ``table`` is not found in the catalog model.

    Example:
        >>> ml.set_column_display("Measurement", "Value", {  # doctest: +SKIP
        ...     "*": {"pre_format": {"format": "%.2f"}}
        ... })
        >>> ml.apply_annotations()  # doctest: +SKIP
    """
    table_obj = self.model.name_to_table(table)
    column = table_obj.columns[column_name]

    if annotation is None:
        column.annotations.pop(COLUMN_DISPLAY_TAG, None)
    else:
        column.annotations[COLUMN_DISPLAY_TAG] = annotation

    return f"{table_obj.name}.{column_name}"

set_display_annotation

set_display_annotation(
    table: str | Table,
    annotation: dict[str, Any] | None,
    column_name: str | None = None,
) -> str

Set the Chaise display annotation on a table or column.

The display annotation controls how the table or column is labeled in the Chaise web UI. The dict shape follows the Chaise display tag specification, e.g. {"name": "Human Readable Name"}. Changes are staged locally until apply_annotations() is called.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Table name (str) or `Table` object.	required
`annotation`	`dict[str, Any] \| None`	Annotation dict, e.g. `{"name": "My Table"}`. Set to `None` to remove the annotation.	required
`column_name`	`str \| None`	If provided, sets the annotation on that column; otherwise sets it on the table.	`None`

Returns:

Type	Description
`str`	Target identifier — table name (str) when setting on the table,
`str`	or `"Table.column"` when setting on a column.

Raises:

Type	Description
`DerivaMLTableTypeError`	If `table` is not found in the catalog model.

Example

ml.set_display_annotation("Image", {"name": "Scan Image"}) # doctest: +SKIP ml.set_display_annotation("Image", {"name": "File Name"}, column_name="Filename") # doctest: +SKIP ml.apply_annotations() # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/annotation.py

@validate_call(config=VALIDATION_CONFIG)
def set_display_annotation(
    self,
    table: str | Table,
    annotation: dict[str, Any] | None,
    column_name: str | None = None,
) -> str:
    """Set the Chaise display annotation on a table or column.

    The display annotation controls how the table or column is labeled in
    the Chaise web UI. The dict shape follows the Chaise display tag
    specification, e.g. ``{"name": "Human Readable Name"}``.
    Changes are staged locally until ``apply_annotations()`` is called.

    Args:
        table: Table name (str) or ``Table`` object.
        annotation: Annotation dict, e.g. ``{"name": "My Table"}``.
            Set to ``None`` to remove the annotation.
        column_name: If provided, sets the annotation on that column;
            otherwise sets it on the table.

    Returns:
        Target identifier — table name (str) when setting on the table,
        or ``"Table.column"`` when setting on a column.

    Raises:
        DerivaMLTableTypeError: If ``table`` is not found in the catalog model.

    Example:
        >>> ml.set_display_annotation("Image", {"name": "Scan Image"})  # doctest: +SKIP
        >>> ml.set_display_annotation("Image", {"name": "File Name"}, column_name="Filename")  # doctest: +SKIP
        >>> ml.apply_annotations()  # doctest: +SKIP
    """
    table_obj = self.model.name_to_table(table)

    if column_name:
        column = table_obj.columns[column_name]
        if annotation is None:
            column.annotations.pop(DISPLAY_TAG, None)
        else:
            column.annotations[DISPLAY_TAG] = annotation
        return f"{table_obj.name}.{column_name}"
    else:
        if annotation is None:
            table_obj.annotations.pop(DISPLAY_TAG, None)
        else:
            table_obj.annotations[DISPLAY_TAG] = annotation
        return table_obj.name

set_strict_preallocated_rid

set_strict_preallocated_rid(
    table: str | Table,
    strict: bool = True,
) -> str

Mark or unmark an asset table as strict-preallocated-RID.

When strict=True, deriva-py's uploader raises DerivaUploadCatalogCreateError if an upload's caller-supplied pre-allocated RID differs from an existing catalog row's RID for the same MD5+Filename. When False (or the annotation is absent), the uploader silently adopts the existing row's RID (legacy behavior preserved for shared artifacts like Execution_Metadata configs).

Use strict mode for tables whose rows are referenced by FK columns in the same upload batch — any unexpected RID reassignment would corrupt those references.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Asset table name or Table object.	required
`strict`	`bool`	If True, set the annotation to `{"strict": true}`. If False, remove the annotation (equivalent to soft mode).	`True`

Returns:

Type	Description
`str`	The table's name.

Raises:

Type	Description
`DerivaMLTableTypeError`	If `table` is not an asset table.

Example

ml.set_strict_preallocated_rid("ScanResult", strict=True) # doctest: +SKIP ml.apply_annotations() # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/annotation.py

@validate_call(config=VALIDATION_CONFIG)
def set_strict_preallocated_rid(
    self,
    table: str | Table,
    strict: bool = True,
) -> str:
    """Mark or unmark an asset table as strict-preallocated-RID.

    When ``strict=True``, deriva-py's uploader raises
    ``DerivaUploadCatalogCreateError`` if an upload's caller-supplied
    pre-allocated RID differs from an existing catalog row's RID for
    the same MD5+Filename. When ``False`` (or the annotation is
    absent), the uploader silently adopts the existing row's RID
    (legacy behavior preserved for shared artifacts like
    ``Execution_Metadata`` configs).

    Use strict mode for tables whose rows are referenced by FK
    columns in the same upload batch — any unexpected RID
    reassignment would corrupt those references.

    Args:
        table: Asset table name or Table object.
        strict: If True, set the annotation to ``{"strict": true}``.
            If False, remove the annotation (equivalent to soft mode).

    Returns:
        The table's name.

    Raises:
        DerivaMLTableTypeError: If ``table`` is not an asset table.

    Example:
        >>> ml.set_strict_preallocated_rid("ScanResult", strict=True)  # doctest: +SKIP
        >>> ml.apply_annotations()  # doctest: +SKIP
    """
    if not self.model.is_asset(table):
        raise DerivaMLTableTypeError("asset table", str(table))
    table_obj = self.model.name_to_table(table)
    if strict:
        table_obj.annotations[STRICT_PREALLOCATED_RID_TAG] = {"strict": True}
    else:
        table_obj.annotations.pop(STRICT_PREALLOCATED_RID_TAG, None)
    return table_obj.name

set_table_display

set_table_display(
    table: str | Table,
    annotation: dict[str, Any] | None,
) -> str

Set the table-display annotation on a table.

Controls table-level display options such as row-naming patterns, default page size, and sort order. The annotation dict follows the Chaise table-display tag specification, e.g. {"row_name": {"row_markdown_pattern": "{{{Name}}}"}}. Changes are staged locally until apply_annotations() is called.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Table name (str) or `Table` object.	required
`annotation`	`dict[str, Any] \| None`	The table-display annotation dict. Set to `None` to remove the annotation.	required

Returns:

Type	Description
`str`	Table name.

Raises:

Type	Description
`DerivaMLTableTypeError`	If `table` is not found in the catalog model.

Example

ml.set_table_display("Subject", { # doctest: +SKIP ... "row_name": { ... "row_markdown_pattern": "{{{Name}}} ({{{Species}}})" ... } ... }) ml.apply_annotations() # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/annotation.py

@validate_call(config=VALIDATION_CONFIG)
def set_table_display(
    self,
    table: str | Table,
    annotation: dict[str, Any] | None,
) -> str:
    """Set the table-display annotation on a table.

    Controls table-level display options such as row-naming patterns,
    default page size, and sort order. The annotation dict follows the
    Chaise table-display tag specification, e.g.
    ``{"row_name": {"row_markdown_pattern": "{{{Name}}}"}}``.
    Changes are staged locally until ``apply_annotations()`` is called.

    Args:
        table: Table name (str) or ``Table`` object.
        annotation: The table-display annotation dict. Set to ``None``
            to remove the annotation.

    Returns:
        Table name.

    Raises:
        DerivaMLTableTypeError: If ``table`` is not found in the catalog model.

    Example:
        >>> ml.set_table_display("Subject", {  # doctest: +SKIP
        ...     "row_name": {
        ...         "row_markdown_pattern": "{{{Name}}} ({{{Species}}})"
        ...     }
        ... })
        >>> ml.apply_annotations()  # doctest: +SKIP
    """
    table_obj = self.model.name_to_table(table)

    if annotation is None:
        table_obj.annotations.pop(TABLE_DISPLAY_TAG, None)
    else:
        table_obj.annotations[TABLE_DISPLAY_TAG] = annotation

    return table_obj.name

set_visible_columns

set_visible_columns(
    table: str | Table,
    annotation: dict[str, Any] | None,
) -> str

Set the visible-columns annotation on a table.

Controls which columns appear in different UI contexts and their order. The annotation is a dict mapping context names (e.g. "compact", "detailed", "entry") to lists of column specs. Each spec may be a plain column-name string, a foreign-key reference list [schema, constraint_name], or a pseudo-column dict per the Chaise visible-columns specification. Changes are staged locally until apply_annotations() is called.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Table name (str) or `Table` object.	required
`annotation`	`dict[str, Any] \| None`	The visible-columns annotation dict. Set to `None` to remove the annotation.	required

Returns:

Type	Description
`str`	Table name.

Raises:

Type	Description
`DerivaMLTableTypeError`	If `table` is not found in the catalog model.

Example

ml.set_visible_columns("Image", { # doctest: +SKIP ... "compact": ["RID", "Filename", "Subject"], ... "detailed": ["RID", "Filename", "Subject", "Description"] ... }) ml.apply_annotations() # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/annotation.py

@validate_call(config=VALIDATION_CONFIG)
def set_visible_columns(
    self,
    table: str | Table,
    annotation: dict[str, Any] | None,
) -> str:
    """Set the visible-columns annotation on a table.

    Controls which columns appear in different UI contexts and their order.
    The annotation is a dict mapping context names (e.g. ``"compact"``,
    ``"detailed"``, ``"entry"``) to lists of column specs. Each spec may
    be a plain column-name string, a foreign-key reference list
    ``[schema, constraint_name]``, or a pseudo-column dict per the Chaise
    visible-columns specification.
    Changes are staged locally until ``apply_annotations()`` is called.

    Args:
        table: Table name (str) or ``Table`` object.
        annotation: The visible-columns annotation dict. Set to ``None``
            to remove the annotation.

    Returns:
        Table name.

    Raises:
        DerivaMLTableTypeError: If ``table`` is not found in the catalog model.

    Example:
        >>> ml.set_visible_columns("Image", {  # doctest: +SKIP
        ...     "compact": ["RID", "Filename", "Subject"],
        ...     "detailed": ["RID", "Filename", "Subject", "Description"]
        ... })
        >>> ml.apply_annotations()  # doctest: +SKIP
    """
    table_obj = self.model.name_to_table(table)

    if annotation is None:
        table_obj.annotations.pop(VISIBLE_COLUMNS_TAG, None)
    else:
        table_obj.annotations[VISIBLE_COLUMNS_TAG] = annotation

    return table_obj.name

set_visible_foreign_keys

set_visible_foreign_keys(
    table: str | Table,
    annotation: dict[str, Any] | None,
) -> str

Set the visible-foreign-keys annotation on a table.

Controls which related tables (via inbound foreign keys) appear in different UI contexts and their order. The annotation is a dict mapping context names to lists of FK specs. Each FK spec is a list [schema, constraint_name] referencing an inbound foreign key, or a pseudo-column dict per the Chaise visible-foreign-keys specification. Changes are staged locally until apply_annotations() is called.

Parameters:

Name	Type	Description	Default
`table`	`str \| Table`	Table name (str) or `Table` object.	required
`annotation`	`dict[str, Any] \| None`	The visible-foreign-keys annotation dict. Set to `None` to remove the annotation.	required

Returns:

Type	Description
`str`	Table name.

Raises:

Type	Description
`DerivaMLTableTypeError`	If `table` is not found in the catalog model.

Example

ml.set_visible_foreign_keys("Subject", { # doctest: +SKIP ... "detailed": [ ... ["domain", "Image_Subject_fkey"], ... ["domain", "Diagnosis_Subject_fkey"] ... ] ... }) ml.apply_annotations() # doctest: +SKIP

Source code in src/deriva_ml/core/mixins/annotation.py

@validate_call(config=VALIDATION_CONFIG)
def set_visible_foreign_keys(
    self,
    table: str | Table,
    annotation: dict[str, Any] | None,
) -> str:
    """Set the visible-foreign-keys annotation on a table.

    Controls which related tables (via inbound foreign keys) appear in
    different UI contexts and their order. The annotation is a dict
    mapping context names to lists of FK specs. Each FK spec is a list
    ``[schema, constraint_name]`` referencing an inbound foreign key, or
    a pseudo-column dict per the Chaise visible-foreign-keys specification.
    Changes are staged locally until ``apply_annotations()`` is called.

    Args:
        table: Table name (str) or ``Table`` object.
        annotation: The visible-foreign-keys annotation dict. Set to
            ``None`` to remove the annotation.

    Returns:
        Table name.

    Raises:
        DerivaMLTableTypeError: If ``table`` is not found in the catalog model.

    Example:
        >>> ml.set_visible_foreign_keys("Subject", {  # doctest: +SKIP
        ...     "detailed": [
        ...         ["domain", "Image_Subject_fkey"],
        ...         ["domain", "Diagnosis_Subject_fkey"]
        ...     ]
        ... })
        >>> ml.apply_annotations()  # doctest: +SKIP
    """
    table_obj = self.model.name_to_table(table)

    if annotation is None:
        table_obj.annotations.pop(VISIBLE_FOREIGN_KEYS_TAG, None)
    else:
        table_obj.annotations[VISIBLE_FOREIGN_KEYS_TAG] = annotation

    return table_obj.name

unpin_schema

unpin_schema() -> None

Clear the schema-cache pin. No-op if not pinned.

Works in any mode. After unpinning, :meth:refresh_schema is allowed again (subject to the pending-rows guard).

Raises:

Type	Description
`FileNotFoundError`	If the workspace has no cache file.

Source code in src/deriva_ml/core/base.py

def unpin_schema(self) -> None:
    """Clear the schema-cache pin. No-op if not pinned.

    Works in any mode. After unpinning, :meth:`refresh_schema`
    is allowed again (subject to the pending-rows guard).

    Raises:
        FileNotFoundError: If the workspace has no cache file.
    """
    SchemaCache(self.working_dir).unpin()

user_list

user_list() -> list[dict[str, str]]

Returns the catalog user list.

Retrieves basic information about all users who have access to the catalog, from the public.ERMrest_Client table.

Note

The user table lives in the public schema, which is outside the domain/ML schema search path used by :meth:get_table_as_dict (name_to_table searches domain_schemas → ml_schema → WWW). This method is the supported accessor for catalog users; get_table_as_dict cannot reach ERMrest_Client.

Returns:

Type	Description
`list[dict[str, str]]`	A list of user dictionaries, each with: - `'ID'`: the user's globus identifier - `'Full_Name'`: the user's full name

Example

users = ml.user_list() # doctest: +SKIP for user in users: # doctest: +SKIP ... print(f"{user['Full_Name']} ({user['ID']})")

Source code in src/deriva_ml/core/mixins/path_builder.py

def user_list(self) -> list[dict[str, str]]:
    """Returns the catalog user list.

    Retrieves basic information about all users who have access to the
    catalog, from the ``public.ERMrest_Client`` table.

    Note:
        The user table lives in the ``public`` schema, which is *outside*
        the domain/ML schema search path used by
        :meth:`get_table_as_dict` (``name_to_table`` searches
        ``domain_schemas → ml_schema → WWW``). This method is the
        supported accessor for catalog users; ``get_table_as_dict``
        cannot reach ``ERMrest_Client``.

    Returns:
        A list of user dictionaries, each with:
            - ``'ID'``: the user's globus identifier
            - ``'Full_Name'``: the user's full name

    Example:
        >>> users = ml.user_list()  # doctest: +SKIP
        >>> for user in users:  # doctest: +SKIP
        ...     print(f"{user['Full_Name']} ({user['ID']})")
    """
    user_path = self.pathBuilder().schemas["public"].tables["ERMrest_Client"]
    return [{"ID": u["ID"], "Full_Name": u["Full_Name"]} for u in user_path.entities().fetch()]

validate_config_directory

validate_config_directory(
    configs_dir: str | PathLike[str],
    *,
    recursive: bool = True,
) -> ConfigValidationReport

Validate every *.py config file under configs_dir.

Walks the directory, parses each Python file, validates every constructor call against the catalog, and aggregates the per- file reports into one :class:ConfigValidationReport. A single broken file does not abort the walk -- the error is recorded in parse_errors and the validator continues with the next file.

Parameters:

Name	Type	Description	Default
`configs_dir`	`str \| PathLike[str]`	Path to the configs directory (typically `src/configs`). `__init__.py` files are included; `__pycache__` and dot-prefixed directories are skipped.	required
`recursive`	`bool`	When `True` (default), recurse into subdirectories. `configs/dev/` per-environment overrides are picked up.	`True`

Returns:

Name	Type	Description
`A`	`ConfigValidationReport`	class:`ConfigValidationReport` with all entries from
	`ConfigValidationReport`	all files, ordered first by file path then by line.

Example

Validate the whole tree::

>>> report = ml.validate_config_directory(  # doctest: +SKIP
...     "src/configs"
... )
>>> if not report.all_valid:  # doctest: +SKIP
...     for r in report.results:
...         if not r.valid:
...             print(r.entry.file, r.entry.line, r.reasons)
...     for pe in report.parse_errors:
...         print("UNPARSEABLE:", pe.file, pe.message)

Source code in src/deriva_ml/core/mixins/dataset.py

def validate_config_directory(
    self,
    configs_dir: str | os.PathLike[str],
    *,
    recursive: bool = True,
) -> ConfigValidationReport:
    """Validate every ``*.py`` config file under ``configs_dir``.

    Walks the directory, parses each Python file, validates every
    constructor call against the catalog, and aggregates the per-
    file reports into one :class:`ConfigValidationReport`. A
    single broken file does not abort the walk -- the error is
    recorded in ``parse_errors`` and the validator continues with
    the next file.

    Args:
        configs_dir: Path to the configs directory (typically
            ``src/configs``). ``__init__.py`` files are included;
            ``__pycache__`` and dot-prefixed directories are
            skipped.
        recursive: When ``True`` (default), recurse into
            subdirectories. ``configs/dev/`` per-environment
            overrides are picked up.

    Returns:
        A :class:`ConfigValidationReport` with all entries from
        all files, ordered first by file path then by line.

    Example:
        Validate the whole tree::

            >>> report = ml.validate_config_directory(  # doctest: +SKIP
            ...     "src/configs"
            ... )
            >>> if not report.all_valid:  # doctest: +SKIP
            ...     for r in report.results:
            ...         if not r.valid:
            ...             print(r.entry.file, r.entry.line, r.reasons)
            ...     for pe in report.parse_errors:
            ...         print("UNPARSEABLE:", pe.file, pe.message)
    """
    from pathlib import Path as _Path

    root = _Path(os.fspath(configs_dir))
    if not root.exists():
        return ConfigValidationReport(
            file_count=0,
            entry_count=0,
            all_valid=False,
            results=[],
            parse_errors=[
                ConfigFileParseError(
                    file=str(root),
                    line=None,
                    message=f"directory not found: {root}",
                )
            ],
        )

    py_files: list[_Path] = []
    if root.is_file():
        if root.suffix == ".py":
            py_files = [root]
    else:
        walker = root.rglob("*.py") if recursive else root.glob("*.py")
        for p in walker:
            # Skip __pycache__ and any dot-prefixed dir.
            if any(part == "__pycache__" or part.startswith(".") for part in p.parts):
                continue
            py_files.append(p)
    py_files.sort()

    all_entries: list[ConfigEntry] = []
    parse_errors: list[ConfigFileParseError] = []
    file_count = 0
    for f in py_files:
        entries, parse_error = parse_config_file(f)
        if parse_error is not None:
            parse_errors.append(parse_error)
            continue
        file_count += 1
        all_entries.extend(entries)

    results = self._validate_config_entries(all_entries)
    return ConfigValidationReport(
        file_count=file_count,
        entry_count=len(all_entries),
        all_valid=(not parse_errors) and all(r.valid for r in results),
        results=results,
        parse_errors=parse_errors,
    )

validate_config_file

validate_config_file(
    path: str | PathLike[str],
) -> ConfigValidationReport

Validate every spec constructor in one hydra-zen config file.

Parses the file via AST (no execution) and validates each DatasetSpecConfig / AssetSpecConfig / Workflow / DerivaMLConfig constructor call against the catalog.

Composes the existing :meth:validate_dataset_specs, :meth:_validate_asset_spec, :meth:_validate_workflow_rid primitives. For DerivaMLConfig entries the validator compares the entry's hostname and catalog_id against the catalog this :class:DerivaML instance is connected to; a mismatch is reported but doesn't make catalog calls.

Parameters:

Name	Type	Description	Default
`path`	`str \| PathLike[str]`	Path to the file. Accepts `str`, `Path`, or any `os.PathLike`.	required

Returns:

Name	Type	Description
`A`	`ConfigValidationReport`	class:`ConfigValidationReport` with one
	`ConfigValidationReport`	class:`ConfigEntryResult` per constructor call found
	`ConfigValidationReport`	(in source order). Files that fail to parse produce a
`single`	`ConfigValidationReport`	class:`ConfigFileParseError`; the entry list is
	`ConfigValidationReport`	empty in that case.

Does not raise on syntax errors or missing files -- both are reported structurally so a caller validating many files can record them without aborting the walk.

Example

Validate one file::

>>> report = ml.validate_config_file(  # doctest: +SKIP
...     "src/configs/datasets.py"
... )
>>> for r in report.results:  # doctest: +SKIP
...     if not r.valid:
...         print(r.entry.file, r.entry.line, r.reasons)

Source code in src/deriva_ml/core/mixins/dataset.py

def validate_config_file(
    self,
    path: str | os.PathLike[str],
) -> ConfigValidationReport:
    """Validate every spec constructor in one hydra-zen config file.

    Parses the file via AST (no execution) and validates each
    ``DatasetSpecConfig`` / ``AssetSpecConfig`` / ``Workflow`` /
    ``DerivaMLConfig`` constructor call against the catalog.

    Composes the existing :meth:`validate_dataset_specs`,
    :meth:`_validate_asset_spec`, :meth:`_validate_workflow_rid`
    primitives. For ``DerivaMLConfig`` entries the validator
    compares the entry's ``hostname`` and ``catalog_id`` against
    the catalog this :class:`DerivaML` instance is connected to;
    a mismatch is reported but doesn't make catalog calls.

    Args:
        path: Path to the file. Accepts ``str``, ``Path``, or any
            ``os.PathLike``.

    Returns:
        A :class:`ConfigValidationReport` with one
        :class:`ConfigEntryResult` per constructor call found
        (in source order). Files that fail to parse produce a
        single :class:`ConfigFileParseError`; the entry list is
        empty in that case.

    Does not raise on syntax errors or missing files -- both are
    reported structurally so a caller validating many files can
    record them without aborting the walk.

    Example:
        Validate one file::

            >>> report = ml.validate_config_file(  # doctest: +SKIP
            ...     "src/configs/datasets.py"
            ... )
            >>> for r in report.results:  # doctest: +SKIP
            ...     if not r.valid:
            ...         print(r.entry.file, r.entry.line, r.reasons)
    """
    entries, parse_error = parse_config_file(path)
    parse_errors: list[ConfigFileParseError] = (
        [parse_error] if parse_error is not None else []
    )
    results = self._validate_config_entries(entries)
    return ConfigValidationReport(
        file_count=1 if parse_error is None else 0,
        entry_count=len(entries),
        all_valid=(not parse_errors) and all(r.valid for r in results),
        results=results,
        parse_errors=parse_errors,
    )

validate_dataset_specs

validate_dataset_specs(
    specs: list[
        DatasetSpec
        | str
        | dict[str, Any]
    ],
) -> DatasetSpecValidationReport

Validate a list of :class:DatasetSpec against the catalog.

Replaces the per-RID ml.lookup_dataset(rid) + ds.dataset_history() cross-check loop a user would otherwise write while iterating on src/configs/datasets.py. Each spec is checked for three orthogonal failure modes — rid_not_found, not_a_dataset, version_not_found — and all three are reported per spec rather than stopping at the first.

This is a metadata-only catalog query. For the heavier full-path check (which materializes bags) see :meth:Execution.dry_run. ADR-0002 captures the rationale for keeping the two surfaces distinct.

Input shorthands are accepted for ergonomics: a plain RID string is parsed via :meth:DatasetSpec.from_shorthand (so "1-XYZ@1.0.0" and "1-XYZ" both work), and a dict is coerced via DatasetSpec(**d).

Duplicate specs in the input are validated independently (no deduplication). Cross-spec duplicate detection lives on the composite :meth:validate_execution_configuration.

Parameters:

Name	Type	Description	Default
`specs`	`list[DatasetSpec \| str \| dict[str, Any]]`	List of dataset specifications to validate. Each element may be a :class:`DatasetSpec`, a shorthand string parseable by :meth:`DatasetSpec.from_shorthand`, or a dict that coerces to :class:`DatasetSpec`.	required

Returns:

Name	Type	Description
`A`	`DatasetSpecValidationReport`	class:`DatasetSpecValidationReport` with one
	`DatasetSpecValidationReport`	class:`DatasetSpecResult` per input spec (in the same
	`DatasetSpecValidationReport`	order) plus a top-level `all_valid` convenience flag.

Raises:

Type	Description
`ValidationError`	If any input element cannot be coerced to a :class:`DatasetSpec` (e.g. malformed version string, missing required fields).

Example

Validate two specs, one good and one with a typo'd version::

>>> from deriva_ml.dataset.aux_classes import DatasetSpec  # doctest: +SKIP
>>> report = ml.validate_dataset_specs(specs=[  # doctest: +SKIP
...     DatasetSpec(rid="2-B4C8", version="0.4.0"),
...     DatasetSpec(rid="2-B4C8", version="9.9.9"),
... ])
>>> report.all_valid  # doctest: +SKIP
False
>>> bad = report.results[1]  # doctest: +SKIP
>>> bad.reasons  # doctest: +SKIP
['version_not_found']
>>> bad.available_versions  # doctest: +SKIP
['0.4.0', '0.3.0']

Source code in src/deriva_ml/core/mixins/dataset.py

def validate_dataset_specs(
    self,
    specs: list[DatasetSpec | str | dict[str, Any]],
) -> DatasetSpecValidationReport:
    """Validate a list of :class:`DatasetSpec` against the catalog.

    Replaces the per-RID ``ml.lookup_dataset(rid)`` +
    ``ds.dataset_history()`` cross-check loop a user would
    otherwise write while iterating on ``src/configs/datasets.py``.
    Each spec is checked for three orthogonal failure modes —
    ``rid_not_found``, ``not_a_dataset``, ``version_not_found`` —
    and all three are reported per spec rather than stopping at
    the first.

    This is a metadata-only catalog query. For the heavier
    full-path check (which materializes bags) see
    :meth:`Execution.dry_run`. ADR-0002 captures the rationale
    for keeping the two surfaces distinct.

    Input shorthands are accepted for ergonomics: a plain RID
    string is parsed via :meth:`DatasetSpec.from_shorthand` (so
    ``"1-XYZ@1.0.0"`` and ``"1-XYZ"`` both work), and a dict is
    coerced via ``DatasetSpec(**d)``.

    Duplicate specs in the input are validated independently
    (no deduplication). Cross-spec duplicate detection lives on
    the composite :meth:`validate_execution_configuration`.

    Args:
        specs: List of dataset specifications to validate. Each
            element may be a :class:`DatasetSpec`, a shorthand
            string parseable by :meth:`DatasetSpec.from_shorthand`,
            or a dict that coerces to :class:`DatasetSpec`.

    Returns:
        A :class:`DatasetSpecValidationReport` with one
        :class:`DatasetSpecResult` per input spec (in the same
        order) plus a top-level ``all_valid`` convenience flag.

    Raises:
        pydantic.ValidationError: If any input element cannot be
            coerced to a :class:`DatasetSpec` (e.g. malformed
            version string, missing required fields).

    Example:
        Validate two specs, one good and one with a typo'd version::

            >>> from deriva_ml.dataset.aux_classes import DatasetSpec  # doctest: +SKIP
            >>> report = ml.validate_dataset_specs(specs=[  # doctest: +SKIP
            ...     DatasetSpec(rid="2-B4C8", version="0.4.0"),
            ...     DatasetSpec(rid="2-B4C8", version="9.9.9"),
            ... ])
            >>> report.all_valid  # doctest: +SKIP
            False
            >>> bad = report.results[1]  # doctest: +SKIP
            >>> bad.reasons  # doctest: +SKIP
            ['version_not_found']
            >>> bad.available_versions  # doctest: +SKIP
            ['0.4.0', '0.3.0']
    """
    # Coerce inputs once so cached lookups can use the canonical form.
    coerced: list[DatasetSpec] = [self._coerce_dataset_spec(s) for s in specs]

    # Per-RID caches so duplicate-RID inputs cost one round-trip each,
    # not N. The values mirror the partial state assembled during a
    # validation pass.
    rid_cache: dict[str, dict[str, Any]] = {}

    results = [self._validate_one_dataset_spec(s, rid_cache) for s in coerced]
    return DatasetSpecValidationReport(
        all_valid=all(r.valid for r in results),
        results=results,
    )

validate_execution_configuration

validate_execution_configuration(
    config: "ExecutionConfiguration",
) -> ExecutionConfigurationValidationReport

Pre-flight validation for an :class:ExecutionConfiguration.

Walks the contained datasets and assets lists, validates the workflow RID, and reports per-spec results plus cross-spec issues (duplicate RIDs across the dataset list, dataset-version conflicts, asset role conflicts). Designed to be cheap to run repeatedly while iterating on a config — only catalog metadata is touched, no bags are materialized.

This method is the lightweight complement to :meth:Execution.dry_run. dry_run is the heavier full-path test that exercises the bag-download + materialization pipeline; validate_execution_configuration answers the cheaper upstream question of "do the RIDs in this config refer to things that exist in the catalog the way I think they do?". See ADR-0002 for the full rationale.

The dataset half of the work is delegated to :meth:validate_dataset_specs — the two methods share the per-spec dataset validation logic.

Parameters:

Name	Type	Description	Default
`config`	`'ExecutionConfiguration'`	The execution configuration to validate. Its `datasets`, `assets`, and `workflow` fields are walked; other fields (`description`, `argv`, `config_choices`) are ignored.	required

Returns:

Name	Type	Description
`A`	`ExecutionConfigurationValidationReport`	class:`ExecutionConfigurationValidationReport` with
	`ExecutionConfigurationValidationReport`	per-spec results, an optional workflow result (None if
	`ExecutionConfigurationValidationReport`	the config has no workflow set), cross-spec issues, and
	`ExecutionConfigurationValidationReport`	an `all_valid` convenience flag.

Raises:

Type	Description
`ValidationError`	If `config` is not an :class:`ExecutionConfiguration`.

Example

Validate a config before invoking deriva-ml-run::

>>> from deriva_ml.execution import ExecutionConfiguration  # doctest: +SKIP
>>> from deriva_ml.dataset.aux_classes import DatasetSpec  # doctest: +SKIP
>>> from deriva_ml.asset.aux_classes import AssetSpec  # doctest: +SKIP
>>> config = ExecutionConfiguration(  # doctest: +SKIP
...     workflow=workflow,
...     datasets=[DatasetSpec(rid="2-B4C8", version="0.4.0")],
...     assets=[AssetSpec(rid="3JSE")],
... )
>>> report = ml.validate_execution_configuration(config)  # doctest: +SKIP
>>> if not report.all_valid:  # doctest: +SKIP
...     for issue in report.cross_spec_issues:
...         print(issue.detail)

Source code in src/deriva_ml/core/mixins/dataset.py

def validate_execution_configuration(
    self,
    config: "ExecutionConfiguration",
) -> ExecutionConfigurationValidationReport:
    """Pre-flight validation for an :class:`ExecutionConfiguration`.

    Walks the contained ``datasets`` and ``assets`` lists, validates
    the workflow RID, and reports per-spec results plus cross-spec
    issues (duplicate RIDs across the dataset list, dataset-version
    conflicts, asset role conflicts). Designed to be cheap to run
    repeatedly while iterating on a config — only catalog metadata
    is touched, no bags are materialized.

    This method is the lightweight complement to
    :meth:`Execution.dry_run`. ``dry_run`` is the heavier full-path
    test that exercises the bag-download + materialization pipeline;
    ``validate_execution_configuration`` answers the cheaper
    upstream question of *"do the RIDs in this config refer to
    things that exist in the catalog the way I think they do?"*.
    See ADR-0002 for the full rationale.

    The dataset half of the work is delegated to
    :meth:`validate_dataset_specs` — the two methods share the
    per-spec dataset validation logic.

    Args:
        config: The execution configuration to validate. Its
            ``datasets``, ``assets``, and ``workflow`` fields are
            walked; other fields (``description``, ``argv``,
            ``config_choices``) are ignored.

    Returns:
        A :class:`ExecutionConfigurationValidationReport` with
        per-spec results, an optional workflow result (None if
        the config has no workflow set), cross-spec issues, and
        an ``all_valid`` convenience flag.

    Raises:
        pydantic.ValidationError: If ``config`` is not an
            :class:`ExecutionConfiguration`.

    Example:
        Validate a config before invoking ``deriva-ml-run``::

            >>> from deriva_ml.execution import ExecutionConfiguration  # doctest: +SKIP
            >>> from deriva_ml.dataset.aux_classes import DatasetSpec  # doctest: +SKIP
            >>> from deriva_ml.asset.aux_classes import AssetSpec  # doctest: +SKIP
            >>> config = ExecutionConfiguration(  # doctest: +SKIP
            ...     workflow=workflow,
            ...     datasets=[DatasetSpec(rid="2-B4C8", version="0.4.0")],
            ...     assets=[AssetSpec(rid="3JSE")],
            ... )
            >>> report = ml.validate_execution_configuration(config)  # doctest: +SKIP
            >>> if not report.all_valid:  # doctest: +SKIP
            ...     for issue in report.cross_spec_issues:
            ...         print(issue.detail)
    """
    # Dataset half — delegate to the singular method.
    dataset_report = self.validate_dataset_specs(specs=list(config.datasets))

    # Asset half — per-spec.
    asset_results = [self._validate_asset_spec(a) for a in config.assets]

    # Workflow.
    workflow_result: WorkflowSpecResult | None
    if config.workflow is not None and config.workflow.workflow_rid is not None:
        workflow_result = self._validate_workflow_rid(config.workflow.workflow_rid)
    else:
        workflow_result = None

    # Cross-spec issues.
    cross_spec_issues = self._collect_cross_spec_issues(
        dataset_specs=list(config.datasets),
        asset_specs=list(config.assets),
    )

    all_valid = (
        dataset_report.all_valid
        and all(r.valid for r in asset_results)
        and (workflow_result is None or workflow_result.valid)
        and not cross_spec_issues
    )

    return ExecutionConfigurationValidationReport(
        all_valid=all_valid,
        dataset_results=dataset_report.results,
        asset_results=asset_results,
        workflow_result=workflow_result,
        cross_spec_issues=cross_spec_issues,
    )

DerivaMLConfig

Bases: BaseModel

Configuration model for DerivaML instances.

This Pydantic model defines all configurable parameters for a DerivaML instance. It can be used directly or via Hydra configuration files.

Attributes:

Name	Type	Description
`hostname`	`str`	Hostname of the Deriva server (e.g., 'deriva.example.org').
`catalog_id`	`str \| int`	Catalog identifier, either numeric ID or catalog name.
`domain_schemas`	`str \| set[str] \| None`	Optional set of domain schema names. If None, auto-detects all non-system schemas. Use this when working with catalogs that have multiple user-defined schemas.
`default_schema`	`str \| None`	The default schema for table creation operations. If None and there is exactly one domain schema, that schema is used. If there are multiple domain schemas, this must be specified for table creation to work without explicit schema parameters.
`project_name`	`str \| None`	Project name for organizing outputs. Defaults to default_schema.
`cache_dir`	`str \| Path \| None`	Directory for caching downloaded datasets. Defaults to working_dir/cache.
`working_dir`	`str \| Path \| None`	Base directory for computation data. Defaults to ~/deriva-ml.
`hydra_runtime_output_dir`	`str \| Path \| None`	Hydra's runtime output directory (set automatically).
`ml_schema`	`str`	Schema name for ML tables. Defaults to 'deriva-ml'.
`logging_level`	`Any`	Logging level for DerivaML. Defaults to WARNING.
`deriva_logging_level`	`Any`	Logging level for Deriva libraries. Defaults to WARNING.
`credential`	`Any`	Authentication credentials. If None, retrieved automatically.
`s3_bucket`	`str \| None`	S3 bucket URL for dataset bag storage (e.g., 's3://my-bucket'). If provided, enables MINID creation and S3 upload for dataset exports. If None, MINID functionality is disabled regardless of use_minid setting.
`use_minid`	`bool \| None`	Whether to use MINID service for dataset bags. Only effective when s3_bucket is configured. Defaults to True when s3_bucket is set, False otherwise.
`clean_execution_dir`	`bool`	Whether to automatically clean execution working directories after successful upload. Defaults to True. Set to False to retain local copies of execution outputs for debugging or manual inspection.
`mode`	`ConnectionMode \| str`	Connection mode. `ConnectionMode.online` (default) talks to the catalog eagerly; `ConnectionMode.offline` stages all writes in a workspace SQLite database. Accepts either an enum value or its string literal ("online"/"offline").

Example

DerivaMLConfig requires a Hydra context to validate; skipping

the example at doctest time but the call shape is real:

config = DerivaMLConfig( # doctest: +SKIP ... hostname='deriva.example.org', ... catalog_id=1, ... default_schema='my_domain', ... logging_level=logging.INFO ... )

Source code in src/deriva_ml/core/config.py

class DerivaMLConfig(BaseModel):
    """Configuration model for DerivaML instances.

    This Pydantic model defines all configurable parameters for a DerivaML instance.
    It can be used directly or via Hydra configuration files.

    Attributes:
        hostname: Hostname of the Deriva server (e.g., 'deriva.example.org').
        catalog_id: Catalog identifier, either numeric ID or catalog name.
        domain_schemas: Optional set of domain schema names. If None, auto-detects all
            non-system schemas. Use this when working with catalogs that have multiple
            user-defined schemas.
        default_schema: The default schema for table creation operations. If None and
            there is exactly one domain schema, that schema is used. If there are multiple
            domain schemas, this must be specified for table creation to work without
            explicit schema parameters.
        project_name: Project name for organizing outputs. Defaults to default_schema.
        cache_dir: Directory for caching downloaded datasets. Defaults to working_dir/cache.
        working_dir: Base directory for computation data. Defaults to ~/deriva-ml.
        hydra_runtime_output_dir: Hydra's runtime output directory (set automatically).
        ml_schema: Schema name for ML tables. Defaults to 'deriva-ml'.
        logging_level: Logging level for DerivaML. Defaults to WARNING.
        deriva_logging_level: Logging level for Deriva libraries. Defaults to WARNING.
        credential: Authentication credentials. If None, retrieved automatically.
        s3_bucket: S3 bucket URL for dataset bag storage (e.g., 's3://my-bucket').
            If provided, enables MINID creation and S3 upload for dataset exports.
            If None, MINID functionality is disabled regardless of use_minid setting.
        use_minid: Whether to use MINID service for dataset bags. Only effective when
            s3_bucket is configured. Defaults to True when s3_bucket is set, False otherwise.
        clean_execution_dir: Whether to automatically clean execution working directories
            after successful upload. Defaults to True. Set to False to retain local copies
            of execution outputs for debugging or manual inspection.
        mode: Connection mode. ``ConnectionMode.online`` (default) talks to the catalog
            eagerly; ``ConnectionMode.offline`` stages all writes in a workspace SQLite
            database. Accepts either an enum value or its string literal ("online"/"offline").

    Example:
        >>> # DerivaMLConfig requires a Hydra context to validate; skipping
        >>> # the example at doctest time but the call shape is real:
        >>> config = DerivaMLConfig(  # doctest: +SKIP
        ...     hostname='deriva.example.org',
        ...     catalog_id=1,
        ...     default_schema='my_domain',
        ...     logging_level=logging.INFO
        ... )
    """

    hostname: str
    catalog_id: str | int = 1
    domain_schemas: str | set[str] | None = None
    default_schema: str | None = None
    project_name: str | None = None
    cache_dir: str | Path | None = None
    working_dir: str | Path | None = None
    hydra_runtime_output_dir: str | Path | None = None
    ml_schema: str = ML_SCHEMA
    logging_level: Any = logging.WARNING
    deriva_logging_level: Any = logging.WARNING
    credential: Any = None
    s3_bucket: str | None = None
    use_minid: bool | None = None  # None means "auto" - True if s3_bucket is set
    clean_execution_dir: bool = True
    mode: ConnectionMode | str = ConnectionMode.online

    @model_validator(mode="after")
    def init_working_dir(self) -> "DerivaMLConfig":
        """Initialize working directory and resolve use_minid after model validation.

        Working-directory resolution mirrors ``DerivaML.__init__`` so the
        two construction paths produce the same effective path:

        * If the user explicitly passed ``working_dir=...``, that exact
          path is honored as-is (made absolute). The auto-namespace
          append of ``<user>/deriva-ml/<host>/<catalog>`` is **not**
          applied — silently appending to a path the user explicitly
          chose would be a surprise (a user passing
          ``working_dir="/tmp/wd"`` would have ended up at
          ``/tmp/wd/<user>/deriva-ml/<host>/<catalog>`` and not been
          told).
        * If ``working_dir`` is None, ``compute_workdir`` produces the
          default per-host/catalog-namespaced path under
          ``~/.deriva-ml/...``.

        The Hydra runtime output dir is read defensively: when the
        config is instantiated outside a Hydra context (notebook,
        ad-hoc script, MCP server, unit test), ``HydraConfig.get()``
        raises ``ValueError: HydraConfig was not set``. We treat that
        as "no Hydra runtime output dir" rather than a fatal config
        error.

        Resolves the use_minid flag based on s3_bucket configuration:
        - If use_minid is explicitly set, use that value (but it only takes effect if s3_bucket is set)
        - If use_minid is None (auto), set it to True if s3_bucket is configured, False otherwise

        Returns:
            Self: The configuration instance with initialized paths.
        """
        # Honor an explicit working_dir as-is; only compute the
        # default when the user didn't specify one. See the
        # docstring above for the symmetry with DerivaML.__init__.
        if self.working_dir is not None:
            self.working_dir = Path(self.working_dir).absolute()
        else:
            self.working_dir = DerivaMLConfig.compute_workdir(
                None, self.catalog_id, self.hostname
            )

        # HydraConfig is only available inside a Hydra run. Outside
        # that context (e.g. constructing the config in a notebook),
        # leave ``hydra_runtime_output_dir`` unset rather than
        # crashing the whole constructor.
        try:
            self.hydra_runtime_output_dir = Path(
                HydraConfig.get().runtime.output_dir
            )
        except ValueError:
            self.hydra_runtime_output_dir = None

        # Resolve use_minid based on s3_bucket configuration
        if self.use_minid is None:
            # Auto mode: enable MINID if s3_bucket is configured
            self.use_minid = self.s3_bucket is not None
        elif self.use_minid and self.s3_bucket is None:
            # User requested MINID but no S3 bucket configured - disable MINID
            self.use_minid = False

        return self

    @staticmethod
    def compute_workdir(
        working_dir: str | Path | None,
        catalog_id: str | int | None = None,
        hostname: str | None = None,
    ) -> Path:
        """Compute the effective working directory path.

        Creates a standardized working directory path. If a base directory is provided,
        appends the current username to prevent conflicts between users. If no directory
        is provided, uses ~/.deriva-ml. The hostname and catalog_id are appended to
        separate data from different servers and catalogs.

        Args:
            working_dir: Base working directory path, or None for default.
            catalog_id: Catalog identifier to include in the path. If None, no
                       catalog subdirectory is created.
            hostname: Server hostname to include in the path. If None, no
                     hostname subdirectory is created.

        Returns:
            Path: Absolute path to the working directory.

        Example:
            >>> # Path structure for a shared-dir working tree:
            >>> # /<working_dir>/<username>/deriva-ml/<hostname>/<catalog_id>
            >>> p = DerivaMLConfig.compute_workdir('/shared/data', '52', 'ml.example.org')
            >>> str(p).endswith('/deriva-ml/ml.example.org/52')
            True
            >>> str(p).startswith('/shared/data/')
            True

            >>> # Path structure for the default (~/.deriva-ml) tree:
            >>> # <home>/.deriva-ml/<hostname>/<catalog_id>
            >>> p = DerivaMLConfig.compute_workdir(None, 1, 'localhost')
            >>> str(p).endswith('/.deriva-ml/localhost/1')
            True
        """
        # Append username and deriva-ml to provided path, or use ~/.deriva-ml as base
        if working_dir:
            base_dir = Path(working_dir) / getpass.getuser() / "deriva-ml"
        else:
            base_dir = Path.home() / ".deriva-ml"
        # Append hostname if provided to separate data from different servers
        if hostname is not None:
            base_dir = base_dir / hostname
        # Append catalog_id if provided
        if catalog_id is not None:
            base_dir = base_dir / str(catalog_id)
        return base_dir.absolute()

compute_workdir `staticmethod`

compute_workdir(
    working_dir: str | Path | None,
    catalog_id: str | int | None = None,
    hostname: str | None = None,
) -> Path

Compute the effective working directory path.

Creates a standardized working directory path. If a base directory is provided, appends the current username to prevent conflicts between users. If no directory is provided, uses ~/.deriva-ml. The hostname and catalog_id are appended to separate data from different servers and catalogs.

Parameters:

Name	Type	Description	Default
`working_dir`	`str \| Path \| None`	Base working directory path, or None for default.	required
`catalog_id`	`str \| int \| None`	Catalog identifier to include in the path. If None, no catalog subdirectory is created.	`None`
`hostname`	`str \| None`	Server hostname to include in the path. If None, no hostname subdirectory is created.	`None`

Returns:

Name	Type	Description
`Path`	`Path`	Absolute path to the working directory.

Example

Path structure for a shared-dir working tree:

///deriva-ml//

p = DerivaMLConfig.compute_workdir('/shared/data', '52', 'ml.example.org') str(p).endswith('/deriva-ml/ml.example.org/52') True str(p).startswith('/shared/data/') True

Path structure for the default (~/.deriva-ml) tree:

/.deriva-ml//

p = DerivaMLConfig.compute_workdir(None, 1, 'localhost') str(p).endswith('/.deriva-ml/localhost/1') True

Source code in src/deriva_ml/core/config.py

@staticmethod
def compute_workdir(
    working_dir: str | Path | None,
    catalog_id: str | int | None = None,
    hostname: str | None = None,
) -> Path:
    """Compute the effective working directory path.

    Creates a standardized working directory path. If a base directory is provided,
    appends the current username to prevent conflicts between users. If no directory
    is provided, uses ~/.deriva-ml. The hostname and catalog_id are appended to
    separate data from different servers and catalogs.

    Args:
        working_dir: Base working directory path, or None for default.
        catalog_id: Catalog identifier to include in the path. If None, no
                   catalog subdirectory is created.
        hostname: Server hostname to include in the path. If None, no
                 hostname subdirectory is created.

    Returns:
        Path: Absolute path to the working directory.

    Example:
        >>> # Path structure for a shared-dir working tree:
        >>> # /<working_dir>/<username>/deriva-ml/<hostname>/<catalog_id>
        >>> p = DerivaMLConfig.compute_workdir('/shared/data', '52', 'ml.example.org')
        >>> str(p).endswith('/deriva-ml/ml.example.org/52')
        True
        >>> str(p).startswith('/shared/data/')
        True

        >>> # Path structure for the default (~/.deriva-ml) tree:
        >>> # <home>/.deriva-ml/<hostname>/<catalog_id>
        >>> p = DerivaMLConfig.compute_workdir(None, 1, 'localhost')
        >>> str(p).endswith('/.deriva-ml/localhost/1')
        True
    """
    # Append username and deriva-ml to provided path, or use ~/.deriva-ml as base
    if working_dir:
        base_dir = Path(working_dir) / getpass.getuser() / "deriva-ml"
    else:
        base_dir = Path.home() / ".deriva-ml"
    # Append hostname if provided to separate data from different servers
    if hostname is not None:
        base_dir = base_dir / hostname
    # Append catalog_id if provided
    if catalog_id is not None:
        base_dir = base_dir / str(catalog_id)
    return base_dir.absolute()

init_working_dir

init_working_dir() -> DerivaMLConfig

Initialize working directory and resolve use_minid after model validation.

Working-directory resolution mirrors DerivaML.__init__ so the two construction paths produce the same effective path:

If the user explicitly passed working_dir=..., that exact path is honored as-is (made absolute). The auto-namespace append of <user>/deriva-ml/<host>/<catalog> is not applied — silently appending to a path the user explicitly chose would be a surprise (a user passing working_dir="/tmp/wd" would have ended up at /tmp/wd/<user>/deriva-ml/<host>/<catalog> and not been told).
If working_dir is None, compute_workdir produces the default per-host/catalog-namespaced path under ~/.deriva-ml/....

The Hydra runtime output dir is read defensively: when the config is instantiated outside a Hydra context (notebook, ad-hoc script, MCP server, unit test), HydraConfig.get() raises ValueError: HydraConfig was not set. We treat that as "no Hydra runtime output dir" rather than a fatal config error.

Resolves the use_minid flag based on s3_bucket configuration: - If use_minid is explicitly set, use that value (but it only takes effect if s3_bucket is set) - If use_minid is None (auto), set it to True if s3_bucket is configured, False otherwise

Returns:

Name	Type	Description
`Self`	`DerivaMLConfig`	The configuration instance with initialized paths.

Source code in src/deriva_ml/core/config.py

@model_validator(mode="after")
def init_working_dir(self) -> "DerivaMLConfig":
    """Initialize working directory and resolve use_minid after model validation.

    Working-directory resolution mirrors ``DerivaML.__init__`` so the
    two construction paths produce the same effective path:

    * If the user explicitly passed ``working_dir=...``, that exact
      path is honored as-is (made absolute). The auto-namespace
      append of ``<user>/deriva-ml/<host>/<catalog>`` is **not**
      applied — silently appending to a path the user explicitly
      chose would be a surprise (a user passing
      ``working_dir="/tmp/wd"`` would have ended up at
      ``/tmp/wd/<user>/deriva-ml/<host>/<catalog>`` and not been
      told).
    * If ``working_dir`` is None, ``compute_workdir`` produces the
      default per-host/catalog-namespaced path under
      ``~/.deriva-ml/...``.

    The Hydra runtime output dir is read defensively: when the
    config is instantiated outside a Hydra context (notebook,
    ad-hoc script, MCP server, unit test), ``HydraConfig.get()``
    raises ``ValueError: HydraConfig was not set``. We treat that
    as "no Hydra runtime output dir" rather than a fatal config
    error.

    Resolves the use_minid flag based on s3_bucket configuration:
    - If use_minid is explicitly set, use that value (but it only takes effect if s3_bucket is set)
    - If use_minid is None (auto), set it to True if s3_bucket is configured, False otherwise

    Returns:
        Self: The configuration instance with initialized paths.
    """
    # Honor an explicit working_dir as-is; only compute the
    # default when the user didn't specify one. See the
    # docstring above for the symmetry with DerivaML.__init__.
    if self.working_dir is not None:
        self.working_dir = Path(self.working_dir).absolute()
    else:
        self.working_dir = DerivaMLConfig.compute_workdir(
            None, self.catalog_id, self.hostname
        )

    # HydraConfig is only available inside a Hydra run. Outside
    # that context (e.g. constructing the config in a notebook),
    # leave ``hydra_runtime_output_dir`` unset rather than
    # crashing the whole constructor.
    try:
        self.hydra_runtime_output_dir = Path(
            HydraConfig.get().runtime.output_dir
        )
    except ValueError:
        self.hydra_runtime_output_dir = None

    # Resolve use_minid based on s3_bucket configuration
    if self.use_minid is None:
        # Auto mode: enable MINID if s3_bucket is configured
        self.use_minid = self.s3_bucket is not None
    elif self.use_minid and self.s3_bucket is None:
        # User requested MINID but no S3 bucket configured - disable MINID
        self.use_minid = False

    return self

DerivaMLException

Bases: Exception

Base exception class for all DerivaML errors.

This is the root exception for all DerivaML-specific errors. Catching this exception will catch any error raised by the DerivaML library.

Attributes:

Name	Type	Description
`_msg`		The error message stored for later access.

Parameters:

Name	Type	Description	Default
`msg`	`str`	Descriptive error message. Defaults to empty string.	`''`

Example

raise DerivaMLException("Failed to connect to catalog") # doctest: +SKIP DerivaMLException: Failed to connect to catalog

Source code in src/deriva_ml/core/exceptions.py

class DerivaMLException(Exception):
    """Base exception class for all DerivaML errors.

    This is the root exception for all DerivaML-specific errors. Catching this
    exception will catch any error raised by the DerivaML library.

    Attributes:
        _msg: The error message stored for later access.

    Args:
        msg: Descriptive error message. Defaults to empty string.

    Example:
        >>> raise DerivaMLException("Failed to connect to catalog")  # doctest: +SKIP
        DerivaMLException: Failed to connect to catalog
    """

    def __init__(self, msg: str = "") -> None:
        super().__init__(msg)
        self._msg = msg

DerivaMLInvalidTerm

Bases: DerivaMLNotFoundError

Exception raised when a vocabulary term is not found or invalid.

Raised when attempting to look up or use a term that doesn't exist in a controlled vocabulary table, or when a term name/synonym cannot be resolved.

Parameters:

Name	Type	Description	Default
`vocabulary`	`str`	Name of the vocabulary table being searched.	required
`term`	`str`	The term name that was not found.	required
`msg`	`str`	Additional context about the error. Defaults to "Term doesn't exist".	`"Term doesn't exist"`

Example

raise DerivaMLInvalidTerm("Diagnosis", "unknown_condition") # doctest: +SKIP DerivaMLInvalidTerm: Invalid term unknown_condition in vocabulary Diagnosis: Term doesn't exist.

Source code in src/deriva_ml/core/exceptions.py

class DerivaMLInvalidTerm(DerivaMLNotFoundError):
    """Exception raised when a vocabulary term is not found or invalid.

    Raised when attempting to look up or use a term that doesn't exist in
    a controlled vocabulary table, or when a term name/synonym cannot be resolved.

    Args:
        vocabulary: Name of the vocabulary table being searched.
        term: The term name that was not found.
        msg: Additional context about the error. Defaults to "Term doesn't exist".

    Example:
        >>> raise DerivaMLInvalidTerm("Diagnosis", "unknown_condition")  # doctest: +SKIP
        DerivaMLInvalidTerm: Invalid term unknown_condition in vocabulary Diagnosis: Term doesn't exist.
    """

    def __init__(self, vocabulary: str, term: str, msg: str = "Term doesn't exist") -> None:
        super().__init__(f"Invalid term {term} in vocabulary {vocabulary}: {msg}.")
        self.vocabulary = vocabulary
        self.term = term

DerivaMLTableTypeError

Bases: DerivaMLDataError

Exception raised when a RID or table is not of the expected type.

Raised when an operation requires a specific table type (e.g., Dataset, Execution) but receives a RID or table reference of a different type.

Parameters:

Name	Type	Description	Default
`table_type`	`str`	The expected table type (e.g., "Dataset", "Execution").	required
`table`	`str`	The actual table name or RID that was provided.	required

Example

raise DerivaMLTableTypeError("Dataset", "1-ABC123") # doctest: +SKIP DerivaMLTableTypeError: Table 1-ABC123 is not of type Dataset.

Source code in src/deriva_ml/core/exceptions.py

class DerivaMLTableTypeError(DerivaMLDataError):
    """Exception raised when a RID or table is not of the expected type.

    Raised when an operation requires a specific table type (e.g., Dataset,
    Execution) but receives a RID or table reference of a different type.

    Args:
        table_type: The expected table type (e.g., "Dataset", "Execution").
        table: The actual table name or RID that was provided.

    Example:
        >>> raise DerivaMLTableTypeError("Dataset", "1-ABC123")  # doctest: +SKIP
        DerivaMLTableTypeError: Table 1-ABC123 is not of type Dataset.
    """

    def __init__(self, table_type: str, table: str) -> None:
        super().__init__(f"Table {table} is not of type {table_type}.")
        self.table_type = table_type
        self.table = table

ExecAssetType

Bases: StrEnum

Execution asset type identifiers.

Defines the types of assets that can be produced or consumed during an execution. These types are used to categorize files associated with workflow runs.

Attributes:

Name	Type	Description
`input_file`	`str`	Input file consumed by the execution.
`output_file`	`str`	Output file produced by the execution.
`notebook_output`	`str`	Jupyter notebook output from the execution.
`model_file`	`str`	Machine learning model file (e.g., .pkl, .h5, .pt).

Source code in src/deriva_ml/core/enums.py

class ExecAssetType(StrEnum):
    """Execution asset type identifiers.

    Defines the types of assets that can be produced or consumed during an execution.
    These types are used to categorize files associated with workflow runs.

    Attributes:
        input_file (str): Input file consumed by the execution.
        output_file (str): Output file produced by the execution.
        notebook_output (str): Jupyter notebook output from the execution.
        model_file (str): Machine learning model file (e.g., .pkl, .h5, .pt).
    """

    input_file = "Input_File"
    output_file = "Output_File"
    notebook_output = "Notebook_Output"
    model_file = "Model_File"

ExecMetadataType

Bases: StrEnum

Execution metadata type identifiers.

Defines the types of metadata that can be associated with an execution.

Attributes:

Name	Type	Description
`execution_config`	`str`	General execution configuration data.
`runtime_env`	`str`	Runtime environment information.
`hydra_config`	`str`	Hydra YAML configuration files (config.yaml, overrides.yaml).
`deriva_config`	`str`	DerivaML execution configuration (configuration.json).
`metrics_file`	`str`	Training-metric log file (typically JSONL, one record per evaluation point — per epoch, per eval step, etc.). Written during execution via `Execution.metrics_file()` and uploaded as an `Execution_Metadata` asset on `commit_output_assets()`. Readback: parse the file from the downloaded bag.

Source code in src/deriva_ml/core/enums.py

class ExecMetadataType(StrEnum):
    """Execution metadata type identifiers.

    Defines the types of metadata that can be associated with an execution.

    Attributes:
        execution_config (str): General execution configuration data.
        runtime_env (str): Runtime environment information.
        hydra_config (str): Hydra YAML configuration files (config.yaml, overrides.yaml).
        deriva_config (str): DerivaML execution configuration (configuration.json).
        metrics_file (str): Training-metric log file (typically JSONL, one
            record per evaluation point — per epoch, per eval step, etc.).
            Written during execution via ``Execution.metrics_file()`` and
            uploaded as an ``Execution_Metadata`` asset on
            ``commit_output_assets()``. Readback: parse the file from
            the downloaded bag.
    """

    execution_config = "Execution_Config"
    runtime_env = "Runtime_Env"
    hydra_config = "Hydra_Config"
    deriva_config = "Deriva_Config"
    metrics_file = "Metrics_File"

FileSpec

Bases: BaseModel

Specification for a file to be added to the Deriva catalog.

Represents file metadata required for creating entries in the File table. Handles URL normalization, ensuring local file paths are converted to tag URIs that uniquely identify the file's origin.

Attributes:

Name	Type	Description
`url`	`str`	File location as URL or local path. Local paths are converted to tag URIs.
`md5`	`str`	MD5 checksum for integrity verification.
`length`	`int`	File size in bytes.
`description`	`str \| None`	Optional description of the file's contents or purpose.
`file_types`	`list[str] \| None`	List of file type classifications from the Asset_Type vocabulary.

Note

The 'File' type is automatically added to file_types if not present when using create_filespecs().

Example

spec = FileSpec( ... url="/data/results.csv", ... md5="d41d8cd98f00b204e9800998ecf8427e", ... length=1024, ... description="Analysis results", ... file_types=["CSV", "Data"] ... )

Source code in src/deriva_ml/core/filespec.py

class FileSpec(BaseModel):
    """Specification for a file to be added to the Deriva catalog.

    Represents file metadata required for creating entries in the File table.
    Handles URL normalization, ensuring local file paths are converted to
    tag URIs that uniquely identify the file's origin.

    Attributes:
        url: File location as URL or local path. Local paths are converted to tag URIs.
        md5: MD5 checksum for integrity verification.
        length: File size in bytes.
        description: Optional description of the file's contents or purpose.
        file_types: List of file type classifications from the Asset_Type vocabulary.

    Note:
        The 'File' type is automatically added to file_types if not present when
        using create_filespecs().

    Example:
        >>> spec = FileSpec(
        ...     url="/data/results.csv",
        ...     md5="d41d8cd98f00b204e9800998ecf8427e",
        ...     length=1024,
        ...     description="Analysis results",
        ...     file_types=["CSV", "Data"]
        ... )
    """

    model_config = {"populate_by_name": True}

    url: str = Field(alias="URL")
    md5: str = Field(alias="MD5")
    length: int = Field(alias="Length")
    description: str | None = Field(default="", alias="Description")
    file_types: list[str] | None = Field(default_factory=list)

    @field_validator("url")
    @classmethod
    def validate_file_url(cls, url: str) -> str:
        """Examine the provided URL. If it's a local path, convert it into a tag URL.

        Args:
            url: The URL to validate and potentially convert

        Returns:
            The validated/converted URL

        Raises:
            ValidationError: If the URL is not a file URL
        """
        url_parts = urlparse(url)
        if url_parts.scheme == "tag":
            # Already a tag URL, so just return it.
            return url
        elif (not url_parts.scheme) or url_parts.scheme == "file":
            # There is no scheme part of the URL, or it is a file URL, so it is a local file path.
            # Convert to a tag URL.
            return f"tag://{gethostname()},{date.today()}:file://{url_parts.path}"
        else:
            raise ValueError("url is not a file URL")

    @classmethod
    def create_filespecs(
        cls, path: Path | str, description: str, file_types: list[str] | Callable[[Path], list[str]] | None = None
    ) -> Generator[FileSpec, None, None]:
        """Generate FileSpec objects for a file or directory.

        Creates FileSpec objects with computed MD5 checksums for each file found.
        For directories, recursively processes all files. The 'File' type is
        automatically prepended to file_types if not already present.

        Args:
            path: Path to a file or directory. If directory, all files are processed recursively.
            description: Description to apply to all generated FileSpecs.
            file_types: Either a static list of file types, or a callable that takes a Path
                and returns a list of types for that specific file. Allows dynamic type
                assignment based on file extension, content, etc.

        Yields:
            FileSpec: A specification for each file with computed checksums and metadata.

        Example:
            Static file types:
                >>> specs = FileSpec.create_filespecs("/data/images", "Images", ["Image"])  # doctest: +SKIP

            Dynamic file types based on extension:
                >>> def get_types(path):  # doctest: +SKIP
                ...     ext = path.suffix.lower()
                ...     return {"png": ["PNG", "Image"], ".jpg": ["JPEG", "Image"]}.get(ext, [])
                >>> specs = FileSpec.create_filespecs("/data", "Mixed files", get_types)  # doctest: +SKIP
        """
        path = Path(path)
        file_types = file_types or []
        # Convert static list to callable for uniform handling
        file_types_fn = file_types if callable(file_types) else lambda _x: file_types

        def create_spec(file_path: Path) -> FileSpec:
            """Create a FileSpec for a single file with computed hashes."""
            hashes = hash_utils.compute_file_hashes(file_path, hashes=frozenset(["md5", "sha256"]))
            md5 = hashes["md5"][0]
            type_list = file_types_fn(file_path)
            # ``length`` must reflect this specific file, not the outer
            # ``path`` (which is the directory when callers pass one
            # for a recursive walk). The original code closed over
            # ``path.stat().st_size``, so every file under a directory
            # walk got the directory's stat size (typically ~64 bytes
            # on macOS / ~4096 bytes on Linux) instead of its own.
            # Asset upload metadata was silently wrong for every
            # directory-mode call.
            return FileSpec(
                length=file_path.stat().st_size,
                md5=md5,
                description=description,
                url=file_path.as_posix(),
                # Ensure 'File' type is always included
                file_types=type_list if "File" in type_list else ["File"] + type_list,
            )

        # Handle both single files and directories (recursive)
        files = [path] if path.is_file() else [f for f in Path(path).rglob("*") if f.is_file()]
        return (create_spec(file) for file in files)

    @staticmethod
    def read_filespec(path: Path | str) -> Generator[FileSpec, None, None]:
        """Read FileSpec objects from a JSON Lines file.

        Parses a JSONL file where each line is a JSON object representing a FileSpec.
        Empty lines are skipped. This is useful for batch processing pre-computed
        file specifications.

        Args:
            path: Path to the .jsonl file containing FileSpec data.

        Yields:
            FileSpec: Parsed FileSpec object for each valid line.

        Example:
            >>> for spec in FileSpec.read_filespec("files.jsonl"):  # doctest: +SKIP
            ...     print(f"{spec.url}: {spec.md5}")
        """
        path = Path(path)
        with path.open("r", encoding="utf-8") as f:
            for line in f:
                line = line.strip()
                if not line:
                    continue
                yield FileSpec(**json.loads(line))

create_filespecs `classmethod`

create_filespecs(
    path: Path | str,
    description: str,
    file_types: list[str]
    | Callable[[Path], list[str]]
    | None = None,
) -> Generator[FileSpec, None, None]

Generate FileSpec objects for a file or directory.

Creates FileSpec objects with computed MD5 checksums for each file found. For directories, recursively processes all files. The 'File' type is automatically prepended to file_types if not already present.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	Path to a file or directory. If directory, all files are processed recursively.	required
`description`	`str`	Description to apply to all generated FileSpecs.	required
`file_types`	`list[str] \| Callable[[Path], list[str]] \| None`	Either a static list of file types, or a callable that takes a Path and returns a list of types for that specific file. Allows dynamic type assignment based on file extension, content, etc.	`None`

Yields:

Name	Type	Description
`FileSpec`	`FileSpec`	A specification for each file with computed checksums and metadata.

Example

Static file types: >>> specs = FileSpec.create_filespecs("/data/images", "Images", ["Image"]) # doctest: +SKIP

Dynamic file types based on extension: >>> def get_types(path): # doctest: +SKIP ... ext = path.suffix.lower() ... return {"png": ["PNG", "Image"], ".jpg": ["JPEG", "Image"]}.get(ext, []) >>> specs = FileSpec.create_filespecs("/data", "Mixed files", get_types) # doctest: +SKIP

Source code in src/deriva_ml/core/filespec.py

@classmethod
def create_filespecs(
    cls, path: Path | str, description: str, file_types: list[str] | Callable[[Path], list[str]] | None = None
) -> Generator[FileSpec, None, None]:
    """Generate FileSpec objects for a file or directory.

    Creates FileSpec objects with computed MD5 checksums for each file found.
    For directories, recursively processes all files. The 'File' type is
    automatically prepended to file_types if not already present.

    Args:
        path: Path to a file or directory. If directory, all files are processed recursively.
        description: Description to apply to all generated FileSpecs.
        file_types: Either a static list of file types, or a callable that takes a Path
            and returns a list of types for that specific file. Allows dynamic type
            assignment based on file extension, content, etc.

    Yields:
        FileSpec: A specification for each file with computed checksums and metadata.

    Example:
        Static file types:
            >>> specs = FileSpec.create_filespecs("/data/images", "Images", ["Image"])  # doctest: +SKIP

        Dynamic file types based on extension:
            >>> def get_types(path):  # doctest: +SKIP
            ...     ext = path.suffix.lower()
            ...     return {"png": ["PNG", "Image"], ".jpg": ["JPEG", "Image"]}.get(ext, [])
            >>> specs = FileSpec.create_filespecs("/data", "Mixed files", get_types)  # doctest: +SKIP
    """
    path = Path(path)
    file_types = file_types or []
    # Convert static list to callable for uniform handling
    file_types_fn = file_types if callable(file_types) else lambda _x: file_types

    def create_spec(file_path: Path) -> FileSpec:
        """Create a FileSpec for a single file with computed hashes."""
        hashes = hash_utils.compute_file_hashes(file_path, hashes=frozenset(["md5", "sha256"]))
        md5 = hashes["md5"][0]
        type_list = file_types_fn(file_path)
        # ``length`` must reflect this specific file, not the outer
        # ``path`` (which is the directory when callers pass one
        # for a recursive walk). The original code closed over
        # ``path.stat().st_size``, so every file under a directory
        # walk got the directory's stat size (typically ~64 bytes
        # on macOS / ~4096 bytes on Linux) instead of its own.
        # Asset upload metadata was silently wrong for every
        # directory-mode call.
        return FileSpec(
            length=file_path.stat().st_size,
            md5=md5,
            description=description,
            url=file_path.as_posix(),
            # Ensure 'File' type is always included
            file_types=type_list if "File" in type_list else ["File"] + type_list,
        )

    # Handle both single files and directories (recursive)
    files = [path] if path.is_file() else [f for f in Path(path).rglob("*") if f.is_file()]
    return (create_spec(file) for file in files)

read_filespec `staticmethod`

read_filespec(
    path: Path | str,
) -> Generator[FileSpec, None, None]

Read FileSpec objects from a JSON Lines file.

Parses a JSONL file where each line is a JSON object representing a FileSpec. Empty lines are skipped. This is useful for batch processing pre-computed file specifications.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	Path to the .jsonl file containing FileSpec data.	required

Yields:

Name	Type	Description
`FileSpec`	`FileSpec`	Parsed FileSpec object for each valid line.

Example

for spec in FileSpec.read_filespec("files.jsonl"): # doctest: +SKIP ... print(f"{spec.url}: {spec.md5}")

Source code in src/deriva_ml/core/filespec.py

@staticmethod
def read_filespec(path: Path | str) -> Generator[FileSpec, None, None]:
    """Read FileSpec objects from a JSON Lines file.

    Parses a JSONL file where each line is a JSON object representing a FileSpec.
    Empty lines are skipped. This is useful for batch processing pre-computed
    file specifications.

    Args:
        path: Path to the .jsonl file containing FileSpec data.

    Yields:
        FileSpec: Parsed FileSpec object for each valid line.

    Example:
        >>> for spec in FileSpec.read_filespec("files.jsonl"):  # doctest: +SKIP
        ...     print(f"{spec.url}: {spec.md5}")
    """
    path = Path(path)
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            yield FileSpec(**json.loads(line))

validate_file_url `classmethod`

validate_file_url(url: str) -> str

Examine the provided URL. If it's a local path, convert it into a tag URL.

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL to validate and potentially convert	required

Returns:

Type	Description
`str`	The validated/converted URL

Raises:

Type	Description
`ValidationError`	If the URL is not a file URL

Source code in src/deriva_ml/core/filespec.py

@field_validator("url")
@classmethod
def validate_file_url(cls, url: str) -> str:
    """Examine the provided URL. If it's a local path, convert it into a tag URL.

    Args:
        url: The URL to validate and potentially convert

    Returns:
        The validated/converted URL

    Raises:
        ValidationError: If the URL is not a file URL
    """
    url_parts = urlparse(url)
    if url_parts.scheme == "tag":
        # Already a tag URL, so just return it.
        return url
    elif (not url_parts.scheme) or url_parts.scheme == "file":
        # There is no scheme part of the URL, or it is a file URL, so it is a local file path.
        # Convert to a tag URL.
        return f"tag://{gethostname()},{date.today()}:file://{url_parts.path}"
    else:
        raise ValueError("url is not a file URL")

FileUploadState

Bases: BaseModel

Tracks the state and result of a file upload operation.

Attributes:

Name	Type	Description
`state`	`UploadState`	Current state of the upload (success, failed, etc.).
`status`	`str`	Detailed status message.
`result`	`Any`	Upload result data, if any.

Source code in src/deriva_ml/core/ermrest.py

class FileUploadState(BaseModel):
    """Tracks the state and result of a file upload operation.

    Attributes:
        state (UploadState): Current state of the upload (success, failed, etc.).
        status (str): Detailed status message.
        result (Any): Upload result data, if any.
    """

    state: UploadState
    status: str
    result: Any

    @computed_field
    @property
    def rid(self) -> RID | None:
        return self.result and self.result["RID"]

MLAsset

Bases: StrEnum

Asset type identifiers.

Defines the types of assets that can be associated with executions.

Attributes:

Name	Type	Description
`execution_metadata`	`str`	Metadata about an execution.
`execution_asset`	`str`	Asset produced by an execution.

Source code in src/deriva_ml/core/enums.py

class MLAsset(StrEnum):
    """Asset type identifiers.

    Defines the types of assets that can be associated with executions.

    Attributes:
        execution_metadata (str): Metadata about an execution.
        execution_asset (str): Asset produced by an execution.
    """

    execution_metadata = "Execution_Metadata"
    execution_asset = "Execution_Asset"

MLVocab

Bases: StrEnum

Controlled vocabulary table identifiers.

Defines the names of controlled vocabulary tables used in DerivaML. These tables store standardized terms with descriptions and synonyms for consistent data classification across the catalog.

Attributes:

Name	Type	Description
`dataset_type`	`str`	Dataset classification vocabulary (e.g., "Training", "Test").
`workflow_type`	`str`	Workflow classification vocabulary (e.g., "Python", "Notebook").
`asset_type`	`str`	Asset/file type classification vocabulary (e.g., "Image", "CSV").
`asset_role`	`str`	Asset role vocabulary for execution relationships (e.g., "Input", "Output").
`execution_status`	`str`	Execution status vocabulary for execution lifecycle states.
`feature_name`	`str`	Feature name vocabulary for ML feature definitions.

Source code in src/deriva_ml/core/enums.py

class MLVocab(StrEnum):
    """Controlled vocabulary table identifiers.

    Defines the names of controlled vocabulary tables used in DerivaML. These tables
    store standardized terms with descriptions and synonyms for consistent data
    classification across the catalog.

    Attributes:
        dataset_type (str): Dataset classification vocabulary (e.g., "Training", "Test").
        workflow_type (str): Workflow classification vocabulary (e.g., "Python", "Notebook").
        asset_type (str): Asset/file type classification vocabulary (e.g., "Image", "CSV").
        asset_role (str): Asset role vocabulary for execution relationships (e.g., "Input", "Output").
        execution_status (str): Execution status vocabulary for execution lifecycle states.
        feature_name (str): Feature name vocabulary for ML feature definitions.
    """

    dataset_type = "Dataset_Type"
    workflow_type = "Workflow_Type"
    asset_type = "Asset_Type"
    asset_role = "Asset_Role"
    execution_status = "Execution_Status"
    feature_name = "Feature_Name"

UploadState

Bases: Enum

File upload operation states.

Represents the various states a file upload operation can be in, from initiation to completion.

Attributes:

Name	Type	Description
`success`	`int`	Upload completed successfully.
`failed`	`int`	Upload failed.
`pending`	`int`	Upload is queued.
`running`	`int`	Upload is in progress.
`paused`	`int`	Upload is temporarily paused.
`aborted`	`int`	Upload was aborted.
`cancelled`	`int`	Upload was cancelled.
`timeout`	`int`	Upload timed out.

Source code in src/deriva_ml/core/enums.py

class UploadState(Enum):
    """File upload operation states.

    Represents the various states a file upload operation can be in, from initiation to completion.

    Attributes:
        success (int): Upload completed successfully.
        failed (int): Upload failed.
        pending (int): Upload is queued.
        running (int): Upload is in progress.
        paused (int): Upload is temporarily paused.
        aborted (int): Upload was aborted.
        cancelled (int): Upload was cancelled.
        timeout (int): Upload timed out.
    """

    success = 0
    failed = 1
    pending = 2
    running = 3
    paused = 4
    aborted = 5
    cancelled = 6
    timeout = 7

configure_logging

configure_logging(
    level: int = logging.WARNING,
    deriva_level: int | None = None,
    format_string: str = DEFAULT_FORMAT,
    handler: Handler | None = None,
) -> logging.Logger

Configure logging for DerivaML and related libraries.

This function sets up logging levels for DerivaML, related libraries (deriva-py, bdbag, bagit), and Hydra loggers. It is designed to:

Configure only specific logger namespaces, not the root logger
Respect Hydra's logging configuration when running under Hydra
Allow deriva-py libraries to have a separate logging level

The logging level hierarchy

deriva_ml logger: uses level
Hydra loggers: follow level (deriva_ml level)
Deriva/bdbag/bagit loggers: use deriva_level (defaults to level)

When running under Hydra

Only sets log levels on specific loggers
Does NOT add handlers (Hydra has already configured them)
Does NOT call basicConfig()

When running standalone (no Hydra): - Sets log levels on specific loggers - Adds a StreamHandler to deriva_ml logger if none exists - Still does NOT touch the root logger or call basicConfig()

Parameters:

Name	Type	Description	Default
`level`	`int`	Log level for deriva_ml and Hydra loggers. Defaults to WARNING.	`WARNING`
`deriva_level`	`int \| None`	Log level for deriva-py libraries (deriva, bagit, bdbag). If None, uses the same level as `level`.	`None`
`format_string`	`str`	Format string for log messages (used only when adding handlers outside Hydra context).	`DEFAULT_FORMAT`
`handler`	`Handler \| None`	Optional handler to add to the deriva_ml logger. If None and not running under Hydra, uses StreamHandler with format_string.	`None`

Returns:

Type	Description
`Logger`	The configured deriva_ml logger.

Example

import logging

Same level for everything

_ = configure_logging(level=logging.DEBUG)

Verbose DerivaML, quieter deriva-py libraries

_ = configure_logging( ... level=logging.INFO, ... deriva_level=logging.WARNING, ... )

Source code in src/deriva_ml/core/logging_config.py

def configure_logging(
    level: int = logging.WARNING,
    deriva_level: int | None = None,
    format_string: str = DEFAULT_FORMAT,
    handler: logging.Handler | None = None,
) -> logging.Logger:
    """Configure logging for DerivaML and related libraries.

    This function sets up logging levels for DerivaML, related libraries
    (deriva-py, bdbag, bagit), and Hydra loggers. It is designed to:

    1. Configure only specific logger namespaces, not the root logger
    2. Respect Hydra's logging configuration when running under Hydra
    3. Allow deriva-py libraries to have a separate logging level

    The logging level hierarchy:
        - deriva_ml logger: uses `level`
        - Hydra loggers: follow `level` (deriva_ml level)
        - Deriva/bdbag/bagit loggers: use `deriva_level` (defaults to `level`)

    When running under Hydra:
        - Only sets log levels on specific loggers
        - Does NOT add handlers (Hydra has already configured them)
        - Does NOT call basicConfig()

    When running standalone (no Hydra):
        - Sets log levels on specific loggers
        - Adds a StreamHandler to deriva_ml logger if none exists
        - Still does NOT touch the root logger or call basicConfig()

    Args:
        level: Log level for deriva_ml and Hydra loggers. Defaults to WARNING.
        deriva_level: Log level for deriva-py libraries (deriva, bagit, bdbag).
                     If None, uses the same level as `level`.
        format_string: Format string for log messages (used only when adding
                      handlers outside Hydra context).
        handler: Optional handler to add to the deriva_ml logger. If None and
                not running under Hydra, uses StreamHandler with format_string.

    Returns:
        The configured deriva_ml logger.

    Example:
        >>> import logging
        >>> # Same level for everything
        >>> _ = configure_logging(level=logging.DEBUG)
        >>>
        >>> # Verbose DerivaML, quieter deriva-py libraries
        >>> _ = configure_logging(
        ...     level=logging.INFO,
        ...     deriva_level=logging.WARNING,
        ... )
    """
    if deriva_level is None:
        deriva_level = level

    # Configure main DerivaML logger
    logger = get_logger()
    logger.setLevel(level)

    # Configure Hydra loggers to follow deriva_ml level
    for logger_name in HYDRA_LOGGERS:
        logging.getLogger(logger_name).setLevel(level)

    # Configure deriva-py and related library loggers
    for logger_name in DERIVA_LOGGERS:
        logging.getLogger(logger_name).setLevel(deriva_level)

    # Only add handlers if not running under Hydra
    # Hydra configures handlers via dictConfig, we don't want to duplicate
    if not is_hydra_initialized():
        if not logger.handlers:
            if handler is None:
                handler = logging.StreamHandler()
                handler.setFormatter(logging.Formatter(format_string))
            logger.addHandler(handler)

    return logger

get_logger

get_logger(
    name: str | None = None,
) -> logging.Logger

Get a DerivaML logger.

Three name forms are accepted:

None — returns the main deriva_ml logger.
A short suffix (no dots), e.g. get_logger("dataset") — returns deriva_ml.dataset.
A full module __name__ (with or without the deriva_ml. prefix), e.g. get_logger(__name__) from inside deriva_ml/dataset/dataset.py — returns deriva_ml.dataset.dataset. Names that already start with deriva_ml are used as-is; the bare string "deriva_ml" is treated identically to None.

Form (3) is the canonical project-wide pattern: every module writes logger = get_logger(__name__) so log messages carry their source module in the hierarchy.

Parameters:

Name	Type	Description	Default
`name`	`str \| None`	Sub-logger name. See above for accepted forms.	`None`

Returns:

Type	Description
`Logger`	The configured logger instance.

Example

logger = get_logger() # deriva_ml get_logger("dataset").name # deriva_ml.dataset 'deriva_ml.dataset' get_logger("deriva_ml.dataset").name # already-prefixed 'deriva_ml.dataset' get_logger("deriva_ml").name # bare root 'deriva_ml'

Source code in src/deriva_ml/core/logging_config.py

def get_logger(name: str | None = None) -> logging.Logger:
    """Get a DerivaML logger.

    Three name forms are accepted:

    1. ``None`` — returns the main ``deriva_ml`` logger.
    2. A short suffix (no dots), e.g. ``get_logger("dataset")`` —
       returns ``deriva_ml.dataset``.
    3. A full module ``__name__`` (with or without the
       ``deriva_ml.`` prefix), e.g.
       ``get_logger(__name__)`` from inside
       ``deriva_ml/dataset/dataset.py`` — returns
       ``deriva_ml.dataset.dataset``. Names that already start
       with ``deriva_ml`` are used as-is; the bare string
       ``"deriva_ml"`` is treated identically to ``None``.

    Form (3) is the canonical project-wide pattern: every module
    writes ``logger = get_logger(__name__)`` so log messages
    carry their source module in the hierarchy.

    Args:
        name: Sub-logger name. See above for accepted forms.

    Returns:
        The configured logger instance.

    Example:
        >>> logger = get_logger()                       # deriva_ml
        >>> get_logger("dataset").name                  # deriva_ml.dataset
        'deriva_ml.dataset'
        >>> get_logger("deriva_ml.dataset").name        # already-prefixed
        'deriva_ml.dataset'
        >>> get_logger("deriva_ml").name                # bare root
        'deriva_ml'
    """
    if name is None or name == LOGGER_NAME:
        return logging.getLogger(LOGGER_NAME)
    if name.startswith(f"{LOGGER_NAME}."):
        return logging.getLogger(name)
    return logging.getLogger(f"{LOGGER_NAME}.{name}")

is_hydra_initialized

is_hydra_initialized() -> bool

Check if running within an initialized Hydra context.

This is used to determine whether Hydra is managing logging configuration. When Hydra is initialized, we avoid adding handlers or calling basicConfig since Hydra has already configured logging via dictConfig.

Returns:

Type	Description
`bool`	True if Hydra's GlobalHydra is initialized, False otherwise.

Example

if is_hydra_initialized(): ... # Hydra is managing logging ... pass

Source code in src/deriva_ml/core/logging_config.py

def is_hydra_initialized() -> bool:
    """Check if running within an initialized Hydra context.

    This is used to determine whether Hydra is managing logging configuration.
    When Hydra is initialized, we avoid adding handlers or calling basicConfig
    since Hydra has already configured logging via dictConfig.

    Returns:
        True if Hydra's GlobalHydra is initialized, False otherwise.

    Example:
        >>> if is_hydra_initialized():
        ...     # Hydra is managing logging
        ...     pass
    """
    try:
        from hydra.core.global_hydra import GlobalHydra

        return GlobalHydra.instance().is_initialized()
    except (ImportError, Exception):
        return False

DerivaML Class

BuiltinTypes module-attribute

ColumnDefinition module-attribute

TableDefinition module-attribute

DerivaML

catalog_provenance property

mode property

workspace property

__del__

__init__

add_dataset_element_type

add_features

add_files

add_term

add_visible_column

add_visible_foreign_key

apply_annotations

apply_catalog_annotations

After creating domain schema and tables...

Or with custom branding:

asset_record_class

bag_info

bootstrap_config

cache_dataset

cache_table

catalog_snapshot

chaise_url

cite

clean_execution_dirs

Clean all execution dirs older than 30 days

Clean all except specific executions

clear_cache

Clear all cache

Clear cache older than 7 days

clear_vocabulary_cache

commit_pending_executions

create_asset

create_execution

create_feature

create_table

create_vocabulary

create_workflow

define_association

delete_dataset

delete_feature

delete_term

diff_schema

download_dataset_bag

download_dir

estimate_bag_size

estimate_denormalized_size

feature_record_class

Get the dynamically generated class

Create a validated feature record

Convert to dict for insertion

feature_values

find_assets

Find all assets in the Model table

Find all assets with type "Training_Data"

Find all assets across all tables

find_datasets

find_experiments

find_features

find_incomplete_executions

find_workflows

from_context classmethod

gc_executions

get_cache_size

get_column_annotations

get_handlebars_template_variables

get_storage_summary

get_table_annotations

get_table_as_dataframe

get_table_as_dict

instantiate classmethod

Create a structured config using hydra-zen

Configure for your environment

Instantiate the config to get a DerivaMLConfig object

Create the DerivaML instance

is_snapshot

`DerivaML` Class

BuiltinTypes `module-attribute`

ColumnDefinition `module-attribute`

TableDefinition `module-attribute`

catalog_provenance `property`

mode `property`

workspace `property`

del

init

from_context `classmethod`

instantiate `classmethod`

compute_workdir `staticmethod`

create_filespecs `classmethod`

read_filespec `staticmethod`

validate_file_url `classmethod`