Sharing and collaboration
DerivaML gives you three ways to share data: a citable persistent identifier (MINID) for a versioned dataset, a portable BDBag archive that a collaborator can unpack without catalog access, and a partial catalog clone (create_ml_workspace) for setting up a fully functional multi-site workspace. By the end of this chapter you will know when to use each mechanism and how to configure the key parameters that control copying behavior.
These mechanisms are not mutually exclusive. A typical handoff might share a MINID for citation purposes, attach a BDBag archive for immediate offline use, and maintain a clone for ongoing collaboration on a shared server.
The three mechanisms at a glance
| Mechanism | What you share | Catalog required by recipient |
|---|---|---|
| MINID | A persistent URL (registered with FAIR identifier service) | No — URL resolves to a BDBag archive |
| BDBag archive | A .bag.zip file containing all tables and asset URLs |
No — local inspection only |
| Catalog clone | A new catalog on any Deriva server | Yes — recipient connects with DerivaML |
Choose a MINID when you want a stable, citable reference that will survive a URL change. Choose a BDBag archive when you need a portable snapshot that a collaborator can hand to code directly. Choose a catalog clone when your collaborator needs to run workflows, add features, or browse data interactively.
How to share a dataset with a MINID
A MINID (Minimal Viable Identifier) is a globally unique, resolvable identifier registered with a FAIR identifier service. When you download a dataset with use_minid=True, DerivaML uploads the bag archive to S3 and registers the resulting URL with the MINID service. The returned identifier remains stable even if your catalog server moves.
Requirements. Your catalog must be configured with an S3 bucket. If it is not, download_dataset_bag(use_minid=True) raises DerivaMLException with a message explaining the missing configuration.
from deriva_ml import DerivaML
ml = DerivaML(hostname="catalog.example.org", catalog_id="1")
dataset = ml.lookup_dataset("1-ABC4")
# Download version 2.0.0 and register a MINID for it
bag = dataset.download_dataset_bag(
version="2.0.0",
use_minid=True,
materialize=True,
)
# The MINID URL is stored on the dataset version record
print(bag.version) # "2.0.0"
The MINID URL is stored in the catalog's dataset version record. Calling download_dataset_bag(use_minid=True) again for the same version returns the cached MINID rather than re-uploading, unless the dataset's download spec has changed (tracked via a spec-hash on the version record).
A collaborator who receives the MINID URL can resolve it to the BDBag archive and materialize it without any DerivaML installation:
# Collaborator's machine — bdbag CLI only
bdbag --resolve-fetch all /path/to/unpacked/bag
Notes
use_minid=Truerequiress3_bucketto be configured on the catalog; confirm this with your catalog administrator before attempting.- MINIDs are version-specific — each call with a different
versionargument creates a separate identifier. - MINID registration is idempotent for a given version: DerivaML checks the spec-hash stored on the version record and only re-registers if the download spec has changed.
- Collaborators who receive only the MINID URL can inspect and use the data without a Deriva account.
How to share a bag archive
download_dataset_bag always produces a .bag.zip archive regardless of whether a MINID is registered. Without use_minid=True, the archive is built client-side from the catalog and stored in the local cache directory.
bag = dataset.download_dataset_bag(
version="2.0.0",
materialize=False, # metadata tables only; omit assets for a smaller archive
)
print(bag.path) # Path to the local unpacked bag directory
The archive at bag.path.parent / f"Dataset_{dataset_rid}_{version}.bag.zip" contains:
data/— CSV exports of every table reachable from the dataset via foreign key pathsfetch.txt— a list of remote asset URLs (Hatrac) for any files that were not materializedbag-info.txt— BDBag metadata: dataset RID, version, checksum, creation datebagit.txt— BagIt specification marker
To inspect the archive without DerivaML:
# Install the bdbag CLI tool
pip install bdbag
bdbag --validate fast /path/to/Dataset_1-ABC4_2.0.0.bag.zip
bdbag --resolve-fetch all /path/to/Dataset_1-ABC4_2.0.0.bag.zip
A collaborator who has DerivaML can open the bag directly with DatasetBag:
from deriva_ml.dataset import DatasetBag
bag = DatasetBag("/path/to/Dataset_1-ABC4")
df = bag.get_table_as_dataframe("Subject")
Notes
- Asset files are not included in the archive unless
materialize=True;fetch.txtcontains their remote URLs instead. - The
bdbagCLI tool is available on PyPI (pip install bdbag) and is the only dependency needed to validate and materialize bags without DerivaML. - Bag contents are determined by the catalog's export annotation on the dataset's root table, which follows all FK paths and stops at other dataset element types.
How to clone a subset of a catalog
create_ml_workspace creates a new catalog on a Deriva server containing only the data reachable from a specified root RID. This is the right choice when a collaborator needs a fully functional catalog they can connect to with DerivaML and use for further analysis or model training.
from deriva_ml.catalog.clone import create_ml_workspace, OrphanStrategy, AssetCopyMode
result = create_ml_workspace(
source_hostname="catalog.example.org",
source_catalog_id="1",
root_rid="3-HXMC", # Starting point for data reachability
dest_hostname="local.example.org", # Where to create the new catalog
alias="my-project-workspace", # Optional human-readable alias
asset_mode=AssetCopyMode.REFERENCES, # Keep asset URLs pointing to source
orphan_strategy=OrphanStrategy.DELETE, # Delete rows with broken FK references
add_ml_schema=True, # Add deriva-ml tracking schema
reinitialize_dataset_versions=True, # Fix dataset version snapshots
)
print(f"Created catalog {result.catalog_id} on {result.hostname}")
print(result.report.to_text()) # Human-readable clone report
The three-stage approach
create_ml_workspace solves a practical problem with real-world catalogs: row-level access policies frequently hide some rows from the cloning user, which would cause foreign key violations if the clone were applied naively. The three-stage approach handles this safely:
Stage 1 — Schema without foreign keys. The destination catalog is created with all tables and columns present, but no FK constraints yet. This allows the data copy to proceed even if some referenced rows are missing.
Stage 2 — Async data copy. Data is copied concurrently across all discovered tables. Discovery uses the root table's export annotation (if present) to follow the same FK paths that bag export uses. When no export annotation exists, it falls back to FK graph traversal. Vocabulary tables and association tables are auto-included by default.
Stage 3 — FK application with orphan handling. After all data is loaded, FK constraints are applied one by one. Rows that would violate a constraint (because their referenced row was hidden from the cloning user) are handled according to the orphan_strategy you specify.
Key parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
source_hostname |
str |
— | Source catalog server |
source_catalog_id |
str |
— | Source catalog ID |
root_rid |
str |
— | Starting RID; determines data reachability |
dest_hostname |
str \| None |
Same as source | Destination server |
alias |
str \| None |
None | Human-readable alias for the new catalog |
asset_mode |
AssetCopyMode |
REFERENCES |
How to handle asset files (see below) |
orphan_strategy |
OrphanStrategy |
FAIL |
What to do with broken FK references |
add_ml_schema |
bool |
True |
Add deriva-ml tracking schema to clone |
reinitialize_dataset_versions |
bool |
True |
Reinitialize dataset version snapshots in the clone |
copy_annotations |
bool |
True |
Copy Chaise/ERMrest annotations to the clone |
copy_policy |
bool |
True |
Copy ACLs and ACL bindings to the clone |
source_credential |
dict \| None |
None | Credentials for the source catalog (multi-site deployments) |
dest_credential |
dict \| None |
None | Credentials for the destination catalog (multi-site deployments) |
table_concurrency |
int |
1 |
Maximum number of tables copied concurrently; lower values reduce server load |
include_tables |
list[str] |
None | Extra tables to include ("schema:table" format) |
exclude_objects |
list[str] |
None | Tables to exclude ("schema:table" format) |
prune_hidden_fkeys |
bool |
False |
Drop FKs whose reference table is hidden |
truncate_oversized |
bool |
False |
Truncate values that exceed index size limits |
Asset modes:
AssetCopyMode.NONE— Asset files are not copied and asset table rows are not copied. Use this when you want the schema and non-asset data only.AssetCopyMode.REFERENCES— Asset table rows are copied with their original URLs intact. Files remain on the source Hatrac server and are still accessible if the source is reachable. Uselocalize_assetslater to copy the files.AssetCopyMode.FULL— Asset files are downloaded from the source and re-uploaded to the destination Hatrac. Fully self-contained but slow for large catalogs.
How to choose an orphan-handling strategy
An orphan row is a row whose FK reference points to a row that is absent from the clone — either because the referenced row was hidden by access policy or because it was excluded by the exclude_objects filter. create_ml_workspace cannot silently ignore orphan rows because FK constraints would reject them.
The orphan_strategy parameter controls what happens when an orphan is detected at Stage 3:
OrphanStrategy.FAIL (default)
The clone operation reports an error for every FK violation and leaves the constraint unapplied. The clone is still usable, but missing FK constraints mean the Chaise web interface may not render relationships correctly.
Use FAIL when:
- You need to understand the full scope of incoherence before deciding how to handle it.
- You are cloning from a well-controlled catalog where orphan rows indicate a real problem.
- You want to review the CloneReport before deciding whether to re-run with a different strategy.
OrphanStrategy.DELETE
Rows with dangling FK references are deleted from the clone. The resulting catalog is fully coherent — all FK constraints are applied — but some rows from the source are absent from the clone.
Use DELETE when:
- You are doing analysis on a subset and do not need the orphan rows.
- The orphan rows are peripheral (metadata for hidden subjects, associations to excluded tables).
- You have reviewed the CloneReport from a FAIL run and confirmed the deletions are acceptable.
!!! warning
DELETE can cascade. If a table in the middle of a FK chain has orphans deleted, rows in dependent tables that pointed to those deleted rows may also become orphans and be deleted in turn. Use the CloneReport to verify what was removed.
OrphanStrategy.NULLIFY
Dangling FK column values are set to NULL rather than deleting the row. This preserves all rows but removes the cross-table link for references that have no valid target.
Use NULLIFY when:
- The FK column is optional (nullok=True) and the relationship is not essential.
- You want to preserve all rows for analysis even when some relationships are broken.
- The orphan references are incidental metadata (e.g., a Created_By FK to a user that is not visible).
!!! note
NULLIFY only works when the FK column allows NULL. If the column is required (nullok=False), the strategy falls back to the behavior of FAIL for that constraint and logs a warning.
Recommended workflow for unfamiliar catalogs: Run create_ml_workspace once with orphan_strategy=OrphanStrategy.FAIL, inspect result.report.to_text(), identify the affected tables, then re-run with DELETE or NULLIFY as appropriate for each table's semantics.
from deriva_ml.catalog.clone import create_ml_workspace, OrphanStrategy
# First pass: fail fast so the CloneReport shows all orphan violations
result = create_ml_workspace(
source_hostname="catalog.example.org",
source_catalog_id="1",
root_rid="3-HXMC",
dest_hostname="local.example.org",
orphan_strategy=OrphanStrategy.FAIL,
)
# Inspect what went wrong
print(result.report.to_text())
# Second pass: after reviewing the report, nullify incidental references
# (e.g., Created_By FK to hidden users) and let the clone succeed
result = create_ml_workspace(
source_hostname="catalog.example.org",
source_catalog_id="1",
root_rid="3-HXMC",
dest_hostname="local.example.org",
orphan_strategy=OrphanStrategy.NULLIFY, # NULLs out dangling optional FKs
)
Notes
orphan_strategyis a single global setting — it applies the same policy to every FK violation in Stage 3. Per-table overrides are not supported; use theCloneReportfrom aFAILrun to confirm the chosen strategy is safe for all affected tables.FAILdoes not destroy the partially-created catalog — it leaves FK constraints unapplied. The clone is still usable for data inspection; only the web interface may render relationships incorrectly.NULLIFYrequires that the FK column is markednullok=True. If the column is required, the strategy falls back toFAILfor that constraint and logs a warning.- After a
FAILrun, the most common recovery path is to re-run withNULLIFY(for optional metadata FKs) orDELETE(for rows that are genuinely unreachable from the root RID). DELETEcan cascade: removing an orphan row may expose additional orphans in dependent tables. Always reviewresult.report.to_text()after aDELETErun to verify the scope of removals.
How to copy assets after a clone
When you clone with asset_mode=AssetCopyMode.REFERENCES, asset URLs in the destination catalog still point to the source Hatrac server. This is efficient — no file transfer at clone time — but requires the source server to remain accessible.
When you need the clone to be self-contained, use localize_assets to copy files from the source Hatrac to the destination Hatrac and update the catalog URLs:
from deriva_ml.catalog.localize import localize_assets
ml = DerivaML(hostname="local.example.org", catalog_id=result.catalog_id)
# Get the RIDs of all assets you want to localize
pb = ml.pathBuilder()
asset_rids = [
r["RID"]
for r in pb.schemas["myschema"].tables["Image"].entities().fetch()
]
localize_result = localize_assets(
catalog=ml,
asset_table="Image",
asset_rids=asset_rids,
schema_name="myschema",
source_hostname="catalog.example.org", # Where the files currently live
chunk_size=50 * 1024 * 1024, # 50 MB chunks for large files
)
print(f"Localized {localize_result.assets_processed} assets")
print(f"Skipped {localize_result.assets_skipped}")
print(f"Failed {localize_result.assets_failed}")
if localize_result.errors:
for err in localize_result.errors:
print(f" Error: {err}")
localize_assets returns a LocalizeResult with counts and a localized_assets list of (rid, old_url, new_url) tuples for audit purposes.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
catalog |
DerivaML \| ErmrestCatalog |
— | Connected catalog to localize into |
asset_table |
str |
— | Name of the asset table |
asset_rids |
list[str] |
— | RIDs of the assets to copy |
schema_name |
str \| None |
None | Schema containing the table; searches all if omitted |
source_hostname |
str \| None |
None | Used when URLs are relative (e.g., cloned with REFERENCES); if omitted, relative-URL assets are treated as already-local and skipped |
hatrac_namespace |
str \| None |
None | Destination namespace; defaults to /hatrac/{asset_table} |
chunk_size |
int \| None |
None | Chunk size in bytes; auto-chunked above 100 MB |
dry_run |
bool |
False |
Report what would be done without making changes |
Notes
- Assets already pointing at the local server are detected by hostname comparison and skipped automatically.
- Relative URLs (e.g.,
/hatrac/...without a hostname) usesource_hostnameif provided; without it, they are treated as already-local and skipped with a log message. localize_assetsis optimized for bulk operations: it fetches all asset records in one query and batches the catalog URL updates.- For very large asset sets, consider processing in chunks (split
asset_ridsinto batches) to avoid long-running operations that could be interrupted.
Access control
DerivaML delegates access control entirely to Deriva's policy engine and Globus Auth. A catalog clone inherits the source catalog's ACLs and ACL bindings when copy_policy=True (the default). Changing who can read or write a catalog requires using the Deriva web interface or ERMrest policy API directly — there is no DerivaML Python API for this.
For Globus-based sharing (sharing a catalog with external collaborators using their institutional credentials), refer to the Deriva documentation and the Globus Auth documentation. DerivaML assumes that authentication is already configured when a DerivaML instance is constructed.
Common pitfalls
!!! warning "create_ml_schema drops all deriva-ml data with CASCADE"
create_ml_schema() checks whether the deriva-ml schema already exists and, if it does, drops it with CASCADE before recreating it. This destroys every Execution, Dataset, Feature, and Workflow record in the schema. Never call create_ml_schema directly on a catalog that already has DerivaML data. The add_ml_schema=True parameter in create_ml_workspace uses an internal guard that checks for existing data before calling create_ml_schema; only use create_ml_schema on a brand-new empty catalog.
!!! warning "OrphanStrategy.DELETE is per-clone, not per-table"
orphan_strategy applies the same strategy to every FK violation encountered during Stage 3. Applying DELETE globally can remove large amounts of data that is legitimate but simply unreachable from the root RID. Always run a FAIL clone first to review the CloneReport, then make a targeted decision before re-running with DELETE.
!!! note "bdbag CLI is required to inspect archives without DerivaML"
Collaborators who receive a .bag.zip archive and do not have DerivaML installed need the bdbag command-line tool (pip install bdbag) to validate and materialize the bag. The archive is a standard BagIt container and can be inspected with any BagIt-compatible tool.
!!! note "Sharing with collaborators who have no Deriva account" BDBag archives and MINID URLs work for collaborators who have no Deriva account and no access policy changes are needed. Reserve catalog clones for collaborators who need to run workflows or browse data interactively — those require a Deriva account and appropriate ACLs on the destination catalog.
See also
- Working with datasets —
download_dataset_bag,estimate_bag_size,DatasetSpecwithtimeout - Working offline —
DatasetBag,restructure_assets, offline feature access - Running an experiment — how executions link inputs and outputs for provenance
- Deriva documentation — catalog administration, ACLs, Hatrac