Skip to content

DerivaML Class

The DerivaML class provides a range of methods to interact with a Deriva catalog. These methods assume tha tthe catalog contains a deriva-ml and a domain schema.

Data Catalog: The catalog must include both the domain schema and a standard ML schema for effective data management.

ERD

  • Domain schema: The domain schema includes the data collected or generated by domain-specific experiments or systems.
  • ML schema: Each entity in the ML schema is designed to capture details of the ML development process. It including the following tables
    • A Dataset represents a data collection, such as aggregation identified for training, validation, and testing purposes.
    • A Workflow represents a specific sequence of computational steps or human interactions.
    • An Execution is an instance of a workflow that a user instantiates at a specific time.
    • An Execution Asset is an output file that results from the execution of a workflow.
    • An Execution Metadata is an asset entity for saving metadata files referencing a given execution.

Core module for DerivaML.

This module provides the primary public interface to DerivaML functionality. It exports the main DerivaML class along with configuration, definitions, and exceptions needed for interacting with Deriva-based ML catalogs.

Key exports
  • DerivaML: Main class for catalog operations and ML workflow management.
  • DerivaMLConfig: Configuration class for DerivaML instances.
  • Exceptions: DerivaMLException and specialized exception types.
  • Definitions: Type definitions, enums, and constants used throughout the package.
Example

from deriva_ml.core import DerivaML, DerivaMLConfig ml = DerivaML('deriva.example.org', 'my_catalog') datasets = ml.find_datasets()

BuiltinTypes module-attribute

BuiltinTypes = BuiltinType

Alias for BuiltinType from deriva.core.typed.

This maintains backwards compatibility with existing DerivaML code that uses the plural form 'BuiltinTypes'. New code should use BuiltinType directly.

ColumnDefinition module-attribute

ColumnDefinition = ColumnDef

Alias for ColumnDef from deriva.core.typed.

This maintains backwards compatibility with existing DerivaML code. New code should use ColumnDef directly.

TableDefinition module-attribute

TableDefinition = TableDef

Alias for TableDef from deriva.core.typed.

This maintains backwards compatibility with existing DerivaML code. New code should use TableDef directly.

DerivaML

Bases: PathBuilderMixin, RidResolutionMixin, VocabularyMixin, WorkflowMixin, FeatureMixin, DatasetMixin, AssetMixin, ExecutionMixin, FileMixin, AnnotationMixin, DerivaMLCatalog

Core class for machine learning operations on a Deriva catalog.

This class provides core functionality for managing ML workflows, features, and datasets in a Deriva catalog. It handles data versioning, feature management, vocabulary control, and execution tracking.

Attributes:

Name Type Description
host_name str

Hostname of the Deriva server (e.g., 'deriva.example.org').

catalog_id Union[str, int]

Catalog identifier or name.

domain_schema str

Schema name for domain-specific tables and relationships.

model DerivaModel

ERMRest model for the catalog.

working_dir Path

Directory for storing computation data and results.

cache_dir Path

Directory for caching downloaded datasets.

ml_schema str

Schema name for ML-specific tables (default: 'deriva_ml').

configuration ExecutionConfiguration

Current execution configuration.

project_name str

Name of the current project.

start_time datetime

Timestamp when this instance was created.

status str

Current status of operations.

Example

ml = DerivaML('deriva.example.org', 'my_catalog') ml.create_feature('my_table', 'new_feature') ml.add_term('vocabulary_table', 'new_term', description='Description of term')

Source code in src/deriva_ml/core/base.py
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
class DerivaML(
    PathBuilderMixin,
    RidResolutionMixin,
    VocabularyMixin,
    WorkflowMixin,
    FeatureMixin,
    DatasetMixin,
    AssetMixin,
    ExecutionMixin,
    FileMixin,
    AnnotationMixin,
    DerivaMLCatalog,
):
    """Core class for machine learning operations on a Deriva catalog.

    This class provides core functionality for managing ML workflows, features, and datasets in a Deriva catalog.
    It handles data versioning, feature management, vocabulary control, and execution tracking.

    Attributes:
        host_name (str): Hostname of the Deriva server (e.g., 'deriva.example.org').
        catalog_id (Union[str, int]): Catalog identifier or name.
        domain_schema (str): Schema name for domain-specific tables and relationships.
        model (DerivaModel): ERMRest model for the catalog.
        working_dir (Path): Directory for storing computation data and results.
        cache_dir (Path): Directory for caching downloaded datasets.
        ml_schema (str): Schema name for ML-specific tables (default: 'deriva_ml').
        configuration (ExecutionConfiguration): Current execution configuration.
        project_name (str): Name of the current project.
        start_time (datetime): Timestamp when this instance was created.
        status (str): Current status of operations.

    Example:
        >>> ml = DerivaML('deriva.example.org', 'my_catalog')
        >>> ml.create_feature('my_table', 'new_feature')
        >>> ml.add_term('vocabulary_table', 'new_term', description='Description of term')
    """

    # Class-level type annotations for DerivaMLCatalog protocol compliance
    ml_schema: str
    domain_schemas: frozenset[str]
    default_schema: str | None
    model: DerivaModel
    cache_dir: Path
    working_dir: Path
    catalog: ErmrestCatalog | ErmrestSnapshot
    catalog_id: str | int

    @classmethod
    def instantiate(cls, config: DerivaMLConfig) -> Self:
        """Create a DerivaML instance from a configuration object.

        This method is the preferred way to instantiate DerivaML when using hydra-zen
        for configuration management. It accepts a DerivaMLConfig (Pydantic model) and
        unpacks it to create the instance.

        This pattern allows hydra-zen's `instantiate()` to work with DerivaML:

        Example with hydra-zen:
            >>> from hydra_zen import builds, instantiate
            >>> from deriva_ml import DerivaML
            >>> from deriva_ml.core.config import DerivaMLConfig
            >>>
            >>> # Create a structured config using hydra-zen
            >>> DerivaMLConf = builds(DerivaMLConfig, populate_full_signature=True)
            >>>
            >>> # Configure for your environment
            >>> conf = DerivaMLConf(
            ...     hostname='deriva.example.org',
            ...     catalog_id='42',
            ...     domain_schema='my_domain',
            ... )
            >>>
            >>> # Instantiate the config to get a DerivaMLConfig object
            >>> config = instantiate(conf)
            >>>
            >>> # Create the DerivaML instance
            >>> ml = DerivaML.instantiate(config)

        Args:
            config: A DerivaMLConfig object containing all configuration parameters.

        Returns:
            A new DerivaML instance configured according to the config object.

        Note:
            The DerivaMLConfig class integrates with Hydra's configuration system
            and registers custom resolvers for computing working directories.
            See `deriva_ml.core.config` for details on configuration options.
        """
        return cls(**config.model_dump())

    @classmethod
    def from_context(cls, path: Path | str | None = None) -> Self:
        """Create a DerivaML instance from a .deriva-context.json file.

        Searches for .deriva-context.json starting from ``path`` (default: cwd),
        walking up parent directories. This enables scripts generated by Claude
        to connect to the same catalog without hardcoding connection details.

        The context file is written by the MCP server's ``connect_catalog`` tool
        and contains hostname, catalog_id, and default_schema.

        Args:
            path: Starting directory to search for the context file.
                Defaults to the current working directory.

        Returns:
            A new DerivaML instance configured from the context file.

        Raises:
            FileNotFoundError: If no .deriva-context.json is found.

        Example::

            # In a script generated by Claude:
            from deriva_ml import DerivaML
            ml = DerivaML.from_context()
            subjects = ml.cache_table("Subject")
        """
        import json

        start = Path(path) if path else Path.cwd()
        context_file = _find_context_file(start)
        with open(context_file) as f:
            ctx = json.load(f)

        kwargs: dict[str, Any] = {
            "hostname": ctx["hostname"],
            "catalog_id": ctx["catalog_id"],
        }
        if ctx.get("default_schema"):
            kwargs["default_schema"] = ctx["default_schema"]
        if ctx.get("working_dir"):
            kwargs["working_dir"] = ctx["working_dir"]

        return cls(**kwargs)

    def __init__(
        self,
        hostname: str,
        catalog_id: str | int,
        domain_schemas: str | set[str] | None = None,
        default_schema: str | None = None,
        project_name: str | None = None,
        cache_dir: str | Path | None = None,
        working_dir: str | Path | None = None,
        hydra_runtime_output_dir: str | Path | None = None,
        ml_schema: str = ML_SCHEMA,
        logging_level: int = logging.WARNING,
        deriva_logging_level: int = logging.WARNING,
        credential: dict | None = None,
        s3_bucket: str | None = None,
        use_minid: bool | None = None,
        check_auth: bool = True,
        clean_execution_dir: bool = True,
    ) -> None:
        """Initializes a DerivaML instance.

        This method will connect to a catalog and initialize local configuration for the ML execution.
        This class is intended to be used as a base class on which domain-specific interfaces are built.

        Args:
            hostname: Hostname of the Deriva server.
            catalog_id: Catalog ID. Either an identifier or a catalog name.
            domain_schemas: Optional set of domain schema names. If None, auto-detects all
                non-system schemas. Use this when working with catalogs that have multiple
                user-defined schemas.
            default_schema: The default schema for table creation operations. If None and
                there is exactly one domain schema, that schema is used. If there are multiple
                domain schemas, this must be specified for table creation to work without
                explicit schema parameters.
            ml_schema: Schema name for ML schema. Used if you have a non-standard configuration of deriva-ml.
            project_name: Project name. Defaults to name of default_schema.
            cache_dir: Directory path for caching data downloaded from the Deriva server as bdbag. If not provided,
                will default to working_dir.
            working_dir: Directory path for storing data used by or generated by any computations. If no value is
                provided, will default to  ${HOME}/deriva_ml
            s3_bucket: S3 bucket URL for dataset bag storage (e.g., 's3://my-bucket'). If provided,
                enables MINID creation and S3 upload for dataset exports. If None, MINID functionality
                is disabled regardless of use_minid setting.
            use_minid: Use the MINID service when downloading dataset bags. Only effective when
                s3_bucket is configured. If None (default), automatically set to True when s3_bucket
                is provided, False otherwise.
            check_auth: Check if the user has access to the catalog.
            clean_execution_dir: Whether to automatically clean up execution working directories
                after successful upload. Defaults to True. Set to False to retain local copies.
        """
        # Get or use provided credentials for server access
        self.credential = credential or get_credential(hostname)

        # Initialize server connection and catalog access
        server = DerivaServer(
            "https",
            hostname,
            credentials=self.credential,
            session_config=self._get_session_config(),
        )
        try:
            if check_auth and server.get_authn_session():
                pass
        except Exception:
            raise DerivaMLException(
                "You are not authorized to access this catalog. "
                "Please check your credentials and make sure you have logged in."
            )
        self.catalog = server.connect_ermrest(catalog_id)
        # Import here to avoid circular imports
        from deriva_ml.model.catalog import DerivaModel
        self.model = DerivaModel(
            self.catalog.getCatalogModel(),
            ml_schema=ml_schema,
            domain_schemas=domain_schemas,
            default_schema=default_schema,
        )

        # Store S3 bucket configuration and resolve use_minid
        self.s3_bucket = s3_bucket
        if use_minid is None:
            # Auto mode: enable MINID if s3_bucket is configured
            self.use_minid = s3_bucket is not None
        elif use_minid and s3_bucket is None:
            # User requested MINID but no S3 bucket configured - disable MINID
            self.use_minid = False
        else:
            self.use_minid = use_minid

        # Set up working and cache directories
        # If working_dir is already provided (e.g. from DerivaMLConfig.instantiate()),
        # use it directly; otherwise compute the default path.
        if working_dir is not None:
            self.working_dir = Path(working_dir).absolute()
        else:
            self.working_dir = DerivaMLConfig.compute_workdir(None, catalog_id, hostname)
        self.working_dir.mkdir(parents=True, exist_ok=True)
        self.hydra_runtime_output_dir = hydra_runtime_output_dir

        self.cache_dir = Path(cache_dir) if cache_dir else self.working_dir / "cache"
        self.cache_dir.mkdir(parents=True, exist_ok=True)

        # Set up logging using centralized configuration
        # This configures deriva_ml, Hydra, and deriva-py loggers without
        # affecting the root logger or calling basicConfig()
        self._logger = configure_logging(
            level=logging_level,
            deriva_level=deriva_logging_level,
        )
        self._logging_level = logging_level
        self._deriva_logging_level = deriva_logging_level

        # Apply deriva's default logger overrides for fine-grained control
        apply_logger_overrides(DEFAULT_LOGGER_OVERRIDES)

        # Store instance configuration
        self.host_name = hostname
        self.catalog_id = catalog_id
        self.ml_schema = ml_schema
        self.configuration = None
        self._execution: Execution | None = None
        self.domain_schemas = self.model.domain_schemas
        self.default_schema = self.model.default_schema
        self.project_name = project_name or self.default_schema or "deriva-ml"
        self.start_time = datetime.now()
        self.status = Status.pending.value
        self.clean_execution_dir = clean_execution_dir

    def __del__(self) -> None:
        """Cleanup method to handle incomplete executions."""
        try:
            # Mark execution as aborted if not completed
            if self._execution and self._execution.status != Status.completed:
                self._execution.update_status(Status.aborted, "Execution Aborted")
        except (AttributeError, requests.HTTPError):
            pass

    @staticmethod
    def _get_session_config() -> dict:
        """Returns customized HTTP session configuration.

        Configures retry behavior and connection settings for HTTP requests to the Deriva server. Settings include:
        - Idempotent retry behavior for all HTTP methods
        - Increased retry attempts for read and connect operations
        - Exponential backoff for retries

        Returns:
            dict: Session configuration dictionary with retry and connection settings.

        Example:
            >>> config = DerivaML._get_session_config()
            >>> print(config['retry_read']) # 8
        """
        # Start with a default configuration
        session_config = DEFAULT_SESSION_CONFIG.copy()

        # Customize retry behavior for robustness
        session_config.update(
            {
                # Allow retries for all HTTP methods (PUT/POST are idempotent)
                "allow_retry_on_all_methods": True,
                # Increase retry attempts for better reliability
                "retry_read": 8,
                "retry_connect": 5,
                # Use exponential backoff for retries
                "retry_backoff_factor": 5,
            }
        )
        return session_config

    def is_snapshot(self) -> bool:
        """Check whether this DerivaML instance is connected to a catalog snapshot.

        Returns:
            True if the underlying catalog has a snapshot timestamp, False otherwise.
        """
        return hasattr(self.catalog, "_snaptime")

    def catalog_snapshot(self, version_snapshot: str) -> Self:
        """Return a new DerivaML instance connected to a specific catalog snapshot.

        Catalog snapshots provide a read-only, point-in-time view of the catalog.
        The snapshot identifier is typically obtained from a dataset version record.

        Args:
            version_snapshot: Snapshot identifier string (e.g., ``"2T-SXEH-JH4A"``),
                usually the ``snapshot`` field from a :class:`DatasetHistory` entry.

        Returns:
            A new DerivaML instance connected to the specified catalog snapshot.
        """
        return DerivaML(
            self.host_name,
            version_snapshot,
            logging_level=self._logging_level,
            deriva_logging_level=self._deriva_logging_level,
        )

    @property
    def _dataset_table(self) -> Table:
        return self.model.schemas[self.model.ml_schema].tables["Dataset"]

    # pathBuilder, domain_path, table_path moved to PathBuilderMixin

    def download_dir(self, cached: bool = False) -> Path:
        """Returns the appropriate download directory.

        Provides the appropriate directory path for storing downloaded files, either in the cache or working directory.

        Args:
            cached: If True, returns the cache directory path. If False, returns the working directory path.

        Returns:
            Path: Directory path where downloaded files should be stored.

        Example:
            >>> cache_dir = ml.download_dir(cached=True)
            >>> work_dir = ml.download_dir(cached=False)
        """
        # Return cache directory if cached=True, otherwise working directory
        return self.cache_dir if cached else self.working_dir

    @property
    def working_data(self):
        """Access the working data cache for this catalog.

        Returns a :class:`WorkingDataCache` backed by a SQLite database in
        the working directory. Use this to cache catalog query results
        (tables, denormalized views, feature values) for reuse across scripts.

        Example::

            # Cache a full table
            df = ml.cache_table("Subject")

            # Check what's cached
            ml.working_data.list_tables()

            # Clear the cache
            ml.working_data.clear()
        """
        from deriva_ml.core.working_data import WorkingDataCache

        if not hasattr(self, "_working_data"):
            self._working_data = WorkingDataCache(self.working_dir)
        return self._working_data

    def cache_table(self, table_name: str, force: bool = False) -> "pd.DataFrame":
        """Fetch a table from the catalog and cache locally as SQLite.

        On first call, fetches all rows from the catalog and stores in the
        working data cache. Subsequent calls return the cached data without
        contacting the catalog. Use ``force=True`` to re-fetch.

        Args:
            table_name: Name of the table to fetch (e.g., "Subject", "Image").
            force: If True, re-fetch even if already cached.

        Returns:
            DataFrame with the table contents.

        Example::

            subjects = ml.cache_table("Subject")
            print(f"{len(subjects)} subjects")

            # Second call returns cached data instantly
            subjects = ml.cache_table("Subject")
        """
        import pandas as pd

        if not force and self.working_data.has_table(table_name):
            return self.working_data.read_table(table_name)

        df = self.get_table_as_dataframe(table_name)
        self.working_data.cache_table(table_name, df)
        return df

    def cache_features(
        self,
        table_name: str,
        feature_name: str,
        force: bool = False,
        **kwargs,
    ) -> "pd.DataFrame":
        """Fetch feature values from the catalog and cache locally.

        On first call, fetches all feature values and stores in the working
        data cache. Subsequent calls return cached data.

        Args:
            table_name: Table the feature is attached to (e.g., "Image").
            feature_name: Name of the feature (e.g., "Classification").
            force: If True, re-fetch even if already cached.
            **kwargs: Additional arguments passed to ``fetch_table_features``
                (e.g., ``selector``, ``workflow``, ``execution``).

        Returns:
            DataFrame with feature value records.

        Example::

            labels = ml.cache_features("Image", "Classification")
            print(labels["Diagnosis_Type"].value_counts())
        """
        import pandas as pd

        cache_key = f"features_{table_name}_{feature_name}"
        if not force and self.working_data.has_table(cache_key):
            return self.working_data.read_table(cache_key)

        features = self.fetch_table_features(
            table_name, feature_name=feature_name, **kwargs
        )
        records = [
            r.model_dump(mode="json") for r in features.get(feature_name, [])
        ]
        df = pd.DataFrame(records)
        self.working_data.cache_table(cache_key, df)
        return df

    @staticmethod
    def globus_login(host: str) -> None:
        """Authenticate with Globus to obtain credentials for a Deriva server.

        Initiates a Globus Native Login flow to obtain OAuth2 tokens required
        by the Deriva server.  The flow uses a device-code grant (no browser
        or local server), and stores refresh tokens so that subsequent calls
        can re-authenticate silently.  The BDBag keychain is also updated so
        that bag downloads can use the same credentials.

        If the user is already logged in for the given host, a message is
        printed and no further action is taken.

        Args:
            host: Hostname of the Deriva server to authenticate with
                (e.g., ``"www.eye-ai.org"``).

        Example:
            >>> DerivaML.globus_login('www.eye-ai.org')
            'Login Successful'
        """
        gnl = GlobusNativeLogin(host=host)
        if gnl.is_logged_in([host]):
            print("You are already logged in.")
        else:
            gnl.login(
                [host],
                no_local_server=True,
                no_browser=True,
                refresh_tokens=True,
                update_bdbag_keychain=True,
            )
            print("Login Successful")

    def chaise_url(self, table: RID | Table | str) -> str:
        """Generates Chaise web interface URL.

        Chaise is Deriva's web interface for data exploration. This method creates a URL that directly links to
        the specified table or record.

        Args:
            table: Table to generate URL for (name, Table object, or RID).

        Returns:
            str: URL in format: https://{host}/chaise/recordset/#{catalog}/{schema}:{table}

        Raises:
            DerivaMLException: If table or RID cannot be found.

        Examples:
            Using table name:
                >>> ml.chaise_url("experiment_table")
                'https://deriva.org/chaise/recordset/#1/schema:experiment_table'

            Using RID:
                >>> ml.chaise_url("1-abc123")
        """
        # Get the table object and build base URI
        table_obj = self.model.name_to_table(table)
        try:
            uri = self.catalog.get_server_uri().replace("ermrest/catalog/", "chaise/recordset/#")
        except DerivaMLException:
            # Handle RID case
            uri = self.cite(cast(str, table))
        return f"{uri}/{urlquote(table_obj.schema.name)}:{urlquote(table_obj.name)}"

    def cite(self, entity: Dict[str, Any] | str, current: bool = False) -> str:
        """Generates citation URL for an entity.

        Creates a URL that can be used to reference a specific entity in the catalog.
        By default, includes the catalog snapshot time to ensure version stability
        (permanent citation). With current=True, returns a URL to the current state.

        Args:
            entity: Either a RID string or a dictionary containing entity data with a 'RID' key.
            current: If True, return URL to current catalog state (no snapshot).
                     If False (default), return permanent citation URL with snapshot time.

        Returns:
            str: Citation URL. Format depends on `current` parameter:
                - current=False: https://{host}/id/{catalog}/{rid}@{snapshot_time}
                - current=True: https://{host}/id/{catalog}/{rid}

        Raises:
            DerivaMLException: If an entity doesn't exist or lacks a RID.

        Examples:
            Permanent citation (default):
                >>> url = ml.cite("1-abc123")
                >>> print(url)
                'https://deriva.org/id/1/1-abc123@2024-01-01T12:00:00'

            Current catalog URL:
                >>> url = ml.cite("1-abc123", current=True)
                >>> print(url)
                'https://deriva.org/id/1/1-abc123'

            Using a dictionary:
                >>> url = ml.cite({"RID": "1-abc123"})
        """
        # Return if already a citation URL
        if isinstance(entity, str) and entity.startswith(f"https://{self.host_name}/id/{self.catalog_id}/"):
            return entity

        try:
            # Resolve RID and create citation URL
            self.resolve_rid(rid := entity if isinstance(entity, str) else entity["RID"])
            base_url = f"https://{self.host_name}/id/{self.catalog_id}/{rid}"
            if current:
                return base_url
            return f"{base_url}@{self.catalog.latest_snapshot().snaptime}"
        except KeyError as e:
            raise DerivaMLException(f"Entity {e} does not have RID column")
        except DerivaMLException as _e:
            raise DerivaMLException("Entity RID does not exist")

    @property
    def catalog_provenance(self) -> "CatalogProvenance | None":
        """Get the provenance information for this catalog.

        Returns provenance information if the catalog has it set. This includes
        information about how the catalog was created (clone, create, schema),
        who created it, when, and any workflow information.

        For cloned catalogs, additional details about the clone operation are
        available in the `clone_details` attribute.

        Returns:
            CatalogProvenance if available, None otherwise.

        Example:
            >>> ml = DerivaML('localhost', '45')
            >>> prov = ml.catalog_provenance
            >>> if prov:
            ...     print(f"Created: {prov.created_at} by {prov.created_by}")
            ...     print(f"Method: {prov.creation_method.value}")
            ...     if prov.is_clone:
            ...         print(f"Cloned from: {prov.clone_details.source_hostname}")
        """
        from deriva_ml.catalog.clone import get_catalog_provenance

        return get_catalog_provenance(self.catalog)

    def user_list(self) -> List[Dict[str, str]]:
        """Returns catalog user list.

        Retrieves basic information about all users who have access to the catalog, including their
        identifiers and full names.

        Returns:
            List[Dict[str, str]]: List of user information dictionaries, each containing:
                - 'ID': User identifier
                - 'Full_Name': User's full name

        Examples:

            >>> users = ml.user_list()
            >>> for user in users:
            ...     print(f"{user['Full_Name']} ({user['ID']})")
        """
        # Get the user table path and fetch basic user info
        user_path = self.pathBuilder().public.ERMrest_Client.path
        return [{"ID": u["ID"], "Full_Name": u["Full_Name"]} for u in user_path.entities().fetch()]

    # resolve_rid, retrieve_rid moved to RidResolutionMixin

    def apply_catalog_annotations(
        self,
        navbar_brand_text: str = "ML Data Browser",
        head_title: str = "Catalog ML",
    ) -> None:
        """Apply catalog-level annotations including the navigation bar and display settings.

        This method configures the Chaise web interface for the catalog. Chaise is Deriva's
        web-based data browser that provides a user-friendly interface for exploring and
        managing catalog data. This method sets up annotations that control how Chaise
        displays and organizes the catalog.

        **Navigation Bar Structure**:
        The method creates a navigation bar with the following menus:
        - **User Info**: Links to Users, Groups, and RID Lease tables
        - **Deriva-ML**: Core ML tables (Workflow, Execution, Dataset, Dataset_Version, etc.)
        - **WWW**: Web content tables (Page, File)
        - **{Domain Schema}**: All domain-specific tables (excludes vocabularies and associations)
        - **Vocabulary**: All controlled vocabulary tables from both ML and domain schemas
        - **Assets**: All asset tables from both ML and domain schemas
        - **Features**: All feature tables with entries named "TableName:FeatureName"
        - **Catalog Registry**: Link to the ermrest registry
        - **Documentation**: Links to ML notebook instructions and Deriva-ML docs

        **Display Settings**:
        - Underscores in table/column names displayed as spaces
        - System columns (RID) shown in compact and entry views
        - Default table set to Dataset
        - Faceting and record deletion enabled
        - Export configurations available to all users

        **Bulk Upload Configuration**:
        Configures upload patterns for asset tables, enabling drag-and-drop file uploads
        through the Chaise interface.

        Call this after creating the domain schema and all tables to initialize the catalog's
        web interface. The navigation menus are dynamically built based on the current schema
        structure, automatically organizing tables into appropriate categories.

        Args:
            navbar_brand_text: Text displayed in the navigation bar brand area.
            head_title: Title displayed in the browser tab.

        Example:
            >>> ml = DerivaML('deriva.example.org', 'my_catalog')
            >>> # After creating domain schema and tables...
            >>> ml.apply_catalog_annotations()
            >>> # Or with custom branding:
            >>> ml.apply_catalog_annotations("My Project Browser", "My ML Project")
        """
        catalog_id = self.model.catalog.catalog_id
        ml_schema = self.ml_schema

        # Build domain schema menu items (one menu per domain schema)
        domain_schema_menus = []
        for domain_schema in sorted(self.domain_schemas):
            if domain_schema not in self.model.schemas:
                continue
            domain_schema_menus.append({
                "name": domain_schema,
                "children": [
                    {
                        "name": tname,
                        "url": f"/chaise/recordset/#{catalog_id}/{domain_schema}:{tname}",
                    }
                    for tname in self.model.schemas[domain_schema].tables
                    # Don't include controlled vocabularies, association tables, or feature tables.
                    if not (
                        self.model.is_vocabulary(tname)
                        or self.model.is_association(tname, pure=False, max_arity=3)
                    )
                ],
            })

        # Build vocabulary menu items (ML schema + all domain schemas)
        vocab_children = [{"name": f"{ml_schema} Vocabularies", "header": True}]
        vocab_children.extend([
            {
                "url": f"/chaise/recordset/#{catalog_id}/{ml_schema}:{tname}",
                "name": tname,
            }
            for tname in self.model.schemas[ml_schema].tables
            if self.model.is_vocabulary(tname)
        ])
        for domain_schema in sorted(self.domain_schemas):
            if domain_schema not in self.model.schemas:
                continue
            vocab_children.append({"name": f"{domain_schema} Vocabularies", "header": True})
            vocab_children.extend([
                {
                    "url": f"/chaise/recordset/#{catalog_id}/{domain_schema}:{tname}",
                    "name": tname,
                }
                for tname in self.model.schemas[domain_schema].tables
                if self.model.is_vocabulary(tname)
            ])

        # Build asset menu items (ML schema + all domain schemas)
        asset_children = [
            {
                "url": f"/chaise/recordset/#{catalog_id}/{ml_schema}:{tname}",
                "name": tname,
            }
            for tname in self.model.schemas[ml_schema].tables
            if self.model.is_asset(tname)
        ]
        for domain_schema in sorted(self.domain_schemas):
            if domain_schema not in self.model.schemas:
                continue
            asset_children.extend([
                {
                    "url": f"/chaise/recordset/#{catalog_id}/{domain_schema}:{tname}",
                    "name": tname,
                }
                for tname in self.model.schemas[domain_schema].tables
                if self.model.is_asset(tname)
            ])

        catalog_annotation = {
            deriva_tags.display: {"name_style": {"underline_space": True}},
            deriva_tags.chaise_config: {
                "headTitle": head_title,
                "navbarBrandText": navbar_brand_text,
                "systemColumnsDisplayEntry": ["RID"],
                "systemColumnsDisplayCompact": ["RID"],
                "defaultTable": {"table": "Dataset", "schema": "deriva-ml"},
                "deleteRecord": True,
                "showFaceting": True,
                "shareCiteAcls": True,
                "exportConfigsSubmenu": {"acls": {"show": ["*"], "enable": ["*"]}},
                "resolverImplicitCatalog": False,
                "navbarMenu": {
                    "newTab": False,
                    "children": [
                        {
                            "name": "User Info",
                            "children": [
                                {
                                    "url": f"/chaise/recordset/#{catalog_id}/public:ERMrest_Client",
                                    "name": "Users",
                                },
                                {
                                    "url": f"/chaise/recordset/#{catalog_id}/public:ERMrest_Group",
                                    "name": "Groups",
                                },
                                {
                                    "url": f"/chaise/recordset/#{catalog_id}/public:ERMrest_RID_Lease",
                                    "name": "ERMrest RID Lease",
                                },
                            ],
                        },
                        {  # All the primary tables in deriva-ml schema.
                            "name": "Deriva-ML",
                            "children": [
                                {
                                    "url": f"/chaise/recordset/#{catalog_id}/{ml_schema}:Workflow",
                                    "name": "Workflow",
                                },
                                {
                                    "url": f"/chaise/recordset/#{catalog_id}/{ml_schema}:Execution",
                                    "name": "Execution",
                                },
                                {
                                    "url": f"/chaise/recordset/#{catalog_id}/{ml_schema}:Execution_Metadata",
                                    "name": "Execution Metadata",
                                },
                                {
                                    "url": f"/chaise/recordset/#{catalog_id}/{ml_schema}:Execution_Asset",
                                    "name": "Execution Asset",
                                },
                                {
                                    "url": f"/chaise/recordset/#{catalog_id}/{ml_schema}:Dataset",
                                    "name": "Dataset",
                                },
                                {
                                    "url": f"/chaise/recordset/#{catalog_id}/{ml_schema}:Dataset_Version",
                                    "name": "Dataset Version",
                                },
                            ],
                        },
                        {  # WWW schema tables.
                            "name": "WWW",
                            "children": [
                                {
                                    "url": f"/chaise/recordset/#{catalog_id}/WWW:Page",
                                    "name": "Page",
                                },
                                {
                                    "url": f"/chaise/recordset/#{catalog_id}/WWW:File",
                                    "name": "File",
                                },
                            ],
                        },
                        *domain_schema_menus,  # One menu per domain schema
                        {  # Vocabulary menu with all controlled vocabularies.
                            "name": "Vocabulary",
                            "children": vocab_children,
                        },
                        {  # List of all asset tables.
                            "name": "Assets",
                            "children": asset_children,
                        },
                        {  # List of all feature tables in the catalog.
                            "name": "Features",
                            "children": [
                                {
                                    "url": f"/chaise/recordset/#{catalog_id}/{f.feature_table.schema.name}:{f.feature_table.name}",
                                    "name": f"{f.target_table.name}:{f.feature_name}",
                                }
                                for f in self.model.find_features()
                            ],
                        },
                        {
                            "url": "/chaise/recordset/#0/ermrest:registry@sort(RID)",
                            "name": "Catalog Registry",
                        },
                        {
                            "name": "Documentation",
                            "children": [
                                {
                                    "url": "https://github.com/informatics-isi-edu/deriva-ml/blob/main/docs/ml_workflow_instruction.md",
                                    "name": "ML Notebook Instruction",
                                },
                                {
                                    "url": "https://informatics-isi-edu.github.io/deriva-ml/",
                                    "name": "Deriva-ML Documentation",
                                },
                            ],
                        },
                    ],
                },
            },
            deriva_tags.bulk_upload: bulk_upload_configuration(model=self.model),
        }
        self.model.annotations.update(catalog_annotation)
        self.model.apply()

    def add_page(self, title: str, content: str) -> None:
        """Adds page to web interface.

        Creates a new page in the catalog's web interface with the specified title and content. The page will be
        accessible through the catalog's navigation system.

        Args:
            title: The title of the page to be displayed in navigation and headers.
            content: The main content of the page can include HTML markup.

        Raises:
            DerivaMLException: If the page creation fails or the user lacks necessary permissions.

        Example:
            >>> ml.add_page(
            ...     title="Analysis Results",
            ...     content="<h1>Results</h1><p>Analysis completed successfully...</p>"
            ... )
        """
        # Insert page into www tables with title and content
        # Use default schema or first domain schema for www tables
        schema = self.default_schema or (sorted(self.domain_schemas)[0] if self.domain_schemas else None)
        if schema is None:
            raise DerivaMLException("No domain schema available for adding pages")
        self.pathBuilder().www.tables[schema].insert([{"Title": title, "Content": content}])

    def create_vocabulary(
        self, vocab_name: str, comment: str = "", schema: str | None = None, update_navbar: bool = True
    ) -> Table:
        """Creates a controlled vocabulary table.

        A controlled vocabulary table maintains a list of standardized terms and their definitions. Each term can have
        synonyms and descriptions to ensure consistent terminology usage across the dataset.

        Args:
            vocab_name: Name for the new vocabulary table. Must be a valid SQL identifier.
            comment: Description of the vocabulary's purpose and usage. Defaults to empty string.
            schema: Schema name to create the table in. If None, uses domain_schema.
            update_navbar: If True (default), automatically updates the navigation bar to include
                the new vocabulary table. Set to False during batch table creation to avoid
                redundant updates, then call apply_catalog_annotations() once at the end.

        Returns:
            Table: ERMRest table object representing the newly created vocabulary table.

        Raises:
            DerivaMLException: If vocab_name is invalid or already exists.

        Examples:
            Create a vocabulary for tissue types:

                >>> table = ml.create_vocabulary(
                ...     vocab_name="tissue_types",
                ...     comment="Standard tissue classifications",
                ...     schema="bio_schema"
                ... )

            Create multiple vocabularies without updating navbar until the end:

                >>> ml.create_vocabulary("Species", update_navbar=False)
                >>> ml.create_vocabulary("Tissue_Type", update_navbar=False)
                >>> ml.apply_catalog_annotations()  # Update navbar once
        """
        # Use default schema if none specified
        schema = schema or self.model._require_default_schema()

        # Create and return vocabulary table with RID-based URI pattern
        try:
            vocab_table = self.model.schemas[schema].create_table(
                VocabularyTableDef(
                    name=vocab_name,
                    curie_template=f"{self.project_name}:{{RID}}",
                    comment=comment,
                )
            )
        except ValueError:
            raise DerivaMLException(f"Table {vocab_name} already exist")

        # Update navbar to include the new vocabulary table
        if update_navbar:
            self.apply_catalog_annotations()

        return vocab_table

    def create_table(self, table: TableDefinition, schema: str | None = None, update_navbar: bool = True) -> Table:
        """Creates a new table in the domain schema.

        Creates a table using the provided TableDefinition object, which specifies the table structure
        including columns, keys, and foreign key relationships. The table is created in the domain
        schema associated with this DerivaML instance.

        **Required Classes**:
        Import the following classes from deriva_ml to define tables:

        - ``TableDefinition``: Defines the complete table structure
        - ``ColumnDefinition``: Defines individual columns with types and constraints
        - ``KeyDefinition``: Defines unique key constraints (optional)
        - ``ForeignKeyDefinition``: Defines foreign key relationships to other tables (optional)
        - ``BuiltinTypes``: Enum of available column data types

        **Available Column Types** (BuiltinTypes enum):
        ``text``, ``int2``, ``int4``, ``int8``, ``float4``, ``float8``, ``boolean``,
        ``date``, ``timestamp``, ``timestamptz``, ``json``, ``jsonb``, ``markdown``,
        ``ermrest_uri``, ``ermrest_rid``, ``ermrest_rcb``, ``ermrest_rmb``,
        ``ermrest_rct``, ``ermrest_rmt``

        Args:
            table: A TableDefinition object containing the complete specification of the table to create.
            update_navbar: If True (default), automatically updates the navigation bar to include
                the new table. Set to False during batch table creation to avoid redundant updates,
                then call apply_catalog_annotations() once at the end.

        Returns:
            Table: The newly created ERMRest table object.

        Raises:
            DerivaMLException: If table creation fails or the definition is invalid.

        Examples:
            **Simple table with basic columns**:

                >>> from deriva_ml import TableDefinition, ColumnDefinition, BuiltinTypes
                >>>
                >>> table_def = TableDefinition(
                ...     name="Experiment",
                ...     column_defs=[
                ...         ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
                ...         ColumnDefinition(name="Date", type=BuiltinTypes.date),
                ...         ColumnDefinition(name="Description", type=BuiltinTypes.markdown),
                ...         ColumnDefinition(name="Score", type=BuiltinTypes.float4),
                ...     ],
                ...     comment="Records of experimental runs"
                ... )
                >>> experiment_table = ml.create_table(table_def)

            **Table with foreign key to another table**:

                >>> from deriva_ml import (
                ...     TableDefinition, ColumnDefinition, ForeignKeyDefinition, BuiltinTypes
                ... )
                >>>
                >>> # Create a Sample table that references Subject
                >>> sample_def = TableDefinition(
                ...     name="Sample",
                ...     column_defs=[
                ...         ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
                ...         ColumnDefinition(name="Subject", type=BuiltinTypes.text, nullok=False),
                ...         ColumnDefinition(name="Collection_Date", type=BuiltinTypes.date),
                ...     ],
                ...     fkey_defs=[
                ...         ForeignKeyDefinition(
                ...             colnames=["Subject"],
                ...             pk_sname=ml.default_schema,  # Schema of referenced table
                ...             pk_tname="Subject",          # Name of referenced table
                ...             pk_colnames=["RID"],         # Column(s) in referenced table
                ...             on_delete="CASCADE",         # Delete samples when subject deleted
                ...         )
                ...     ],
                ...     comment="Biological samples collected from subjects"
                ... )
                >>> sample_table = ml.create_table(sample_def)

            **Table with unique key constraint**:

                >>> from deriva_ml import (
                ...     TableDefinition, ColumnDefinition, KeyDefinition, BuiltinTypes
                ... )
                >>>
                >>> protocol_def = TableDefinition(
                ...     name="Protocol",
                ...     column_defs=[
                ...         ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
                ...         ColumnDefinition(name="Version", type=BuiltinTypes.text, nullok=False),
                ...         ColumnDefinition(name="Description", type=BuiltinTypes.markdown),
                ...     ],
                ...     key_defs=[
                ...         KeyDefinition(
                ...             colnames=["Name", "Version"],
                ...             constraint_names=[["myschema", "Protocol_Name_Version_key"]],
                ...             comment="Each protocol name+version must be unique"
                ...         )
                ...     ],
                ...     comment="Experimental protocols with versioning"
                ... )
                >>> protocol_table = ml.create_table(protocol_def)

            **Batch creation without navbar updates**:

                >>> ml.create_table(table1_def, update_navbar=False)
                >>> ml.create_table(table2_def, update_navbar=False)
                >>> ml.create_table(table3_def, update_navbar=False)
                >>> ml.apply_catalog_annotations()  # Update navbar once at the end
        """
        # Use default schema if none specified
        schema = schema or self.model._require_default_schema()

        # Create table in domain schema using provided definition
        # Handle both TableDefinition (dataclass with to_dict) and plain dicts
        table_dict = table.to_dict() if hasattr(table, 'to_dict') else table
        new_table = self.model.schemas[schema].create_table(table_dict)

        # Update navbar to include the new table
        if update_navbar:
            self.apply_catalog_annotations()

        return new_table

    def define_association(
        self,
        associates: list,
        metadata: list | None = None,
        table_name: str | None = None,
        comment: str | None = None,
        **kwargs,
    ) -> dict:
        """Build an association table definition with vocab-aware key selection.

        Creates a table definition that links two or more tables via an association
        (many-to-many) table. Non-vocabulary tables automatically use RID as the
        foreign key target, while vocabulary tables use their Name key.

        Use with ``create_table()`` to create the association table in the catalog.

        Args:
            associates: Tables to associate. Each item can be:
                - A Table object
                - A (name, Table) tuple to customize the column name
                - A (name, nullok, Table) tuple for nullable references
                - A Key object for explicit key selection
            metadata: Additional metadata columns or reference targets.
            table_name: Name for the association table. Auto-generated if omitted.
            comment: Comment for the association table.
            **kwargs: Additional arguments passed to Table.define_association.

        Returns:
            Table definition dict suitable for ``create_table()``.

        Example::

            # Associate Image with Subject (many-to-many)
            image_table = ml.model.name_to_table("Image")
            subject_table = ml.model.name_to_table("Subject")
            assoc_def = ml.define_association(
                associates=[image_table, subject_table],
                comment="Links images to subjects",
            )
            ml.create_table(assoc_def)
        """
        return self.model._define_association(
            associates=associates,
            metadata=metadata,
            table_name=table_name,
            comment=comment,
            **kwargs,
        )

    # =========================================================================
    # Cache and Directory Management
    # =========================================================================

    def clear_cache(self, older_than_days: int | None = None) -> dict[str, int]:
        """Clear the dataset cache directory.

        Removes cached dataset bags from the cache directory. Can optionally filter
        by age to only remove old cache entries.

        Args:
            older_than_days: If provided, only remove cache entries older than this
                many days. If None, removes all cache entries.

        Returns:
            dict with keys:
                - 'files_removed': Number of files removed
                - 'dirs_removed': Number of directories removed
                - 'bytes_freed': Total bytes freed
                - 'errors': Number of removal errors

        Example:
            >>> ml = DerivaML('deriva.example.org', 'my_catalog')
            >>> # Clear all cache
            >>> result = ml.clear_cache()
            >>> print(f"Freed {result['bytes_freed'] / 1e6:.1f} MB")
            >>>
            >>> # Clear cache older than 7 days
            >>> result = ml.clear_cache(older_than_days=7)
        """
        import shutil
        import time

        stats = {'files_removed': 0, 'dirs_removed': 0, 'bytes_freed': 0, 'errors': 0}

        if not self.cache_dir.exists():
            return stats

        cutoff_time = None
        if older_than_days is not None:
            cutoff_time = time.time() - (older_than_days * 24 * 60 * 60)

        try:
            for entry in self.cache_dir.iterdir():
                try:
                    # Check age if filtering
                    if cutoff_time is not None:
                        entry_mtime = entry.stat().st_mtime
                        if entry_mtime > cutoff_time:
                            continue  # Skip recent entries

                    # Calculate size before removal
                    if entry.is_dir():
                        entry_size = sum(f.stat().st_size for f in entry.rglob('*') if f.is_file())
                        shutil.rmtree(entry)
                        stats['dirs_removed'] += 1
                    else:
                        entry_size = entry.stat().st_size
                        entry.unlink()
                        stats['files_removed'] += 1

                    stats['bytes_freed'] += entry_size
                except (OSError, PermissionError) as e:
                    self._logger.warning(f"Failed to remove cache entry {entry}: {e}")
                    stats['errors'] += 1

        except OSError as e:
            self._logger.error(f"Failed to iterate cache directory: {e}")
            stats['errors'] += 1

        return stats

    def get_cache_size(self) -> dict[str, int | float]:
        """Get the current size of the cache directory.

        Returns:
            dict with keys:
                - 'total_bytes': Total size in bytes
                - 'total_mb': Total size in megabytes
                - 'file_count': Number of files
                - 'dir_count': Number of directories

        Example:
            >>> ml = DerivaML('deriva.example.org', 'my_catalog')
            >>> size = ml.get_cache_size()
            >>> print(f"Cache size: {size['total_mb']:.1f} MB ({size['file_count']} files)")
        """
        stats = {'total_bytes': 0, 'total_mb': 0.0, 'file_count': 0, 'dir_count': 0}

        if not self.cache_dir.exists():
            return stats

        for entry in self.cache_dir.rglob('*'):
            if entry.is_file():
                stats['total_bytes'] += entry.stat().st_size
                stats['file_count'] += 1
            elif entry.is_dir():
                stats['dir_count'] += 1

        stats['total_mb'] = stats['total_bytes'] / (1024 * 1024)
        return stats

    def list_execution_dirs(self) -> list[dict[str, any]]:
        """List execution working directories.

        Returns information about each execution directory in the working directory,
        useful for identifying orphaned or incomplete execution outputs.

        Returns:
            List of dicts, each containing:
                - 'execution_rid': The execution RID (directory name)
                - 'path': Full path to the directory
                - 'size_bytes': Total size in bytes
                - 'size_mb': Total size in megabytes
                - 'modified': Last modification time (datetime)
                - 'file_count': Number of files

        Example:
            >>> ml = DerivaML('deriva.example.org', 'my_catalog')
            >>> dirs = ml.list_execution_dirs()
            >>> for d in dirs:
            ...     print(f"{d['execution_rid']}: {d['size_mb']:.1f} MB")
        """
        from datetime import datetime

        from deriva_ml.dataset.upload import upload_root

        results = []
        exec_root = upload_root(self.working_dir) / "execution"

        if not exec_root.exists():
            return results

        for entry in exec_root.iterdir():
            if entry.is_dir():
                size_bytes = sum(f.stat().st_size for f in entry.rglob('*') if f.is_file())
                file_count = sum(1 for f in entry.rglob('*') if f.is_file())
                mtime = datetime.fromtimestamp(entry.stat().st_mtime)

                results.append({
                    'execution_rid': entry.name,
                    'path': str(entry),
                    'size_bytes': size_bytes,
                    'size_mb': size_bytes / (1024 * 1024),
                    'modified': mtime,
                    'file_count': file_count,
                })

        return sorted(results, key=lambda x: x['modified'], reverse=True)

    def clean_execution_dirs(
        self,
        older_than_days: int | None = None,
        exclude_rids: list[str] | None = None,
    ) -> dict[str, int]:
        """Clean up execution working directories.

        Removes execution output directories from the local working directory.
        Use this to free up disk space from completed or orphaned executions.

        Args:
            older_than_days: If provided, only remove directories older than this
                many days. If None, removes all execution directories (except excluded).
            exclude_rids: List of execution RIDs to preserve (never remove).

        Returns:
            dict with keys:
                - 'dirs_removed': Number of directories removed
                - 'bytes_freed': Total bytes freed
                - 'errors': Number of removal errors

        Example:
            >>> ml = DerivaML('deriva.example.org', 'my_catalog')
            >>> # Clean all execution dirs older than 30 days
            >>> result = ml.clean_execution_dirs(older_than_days=30)
            >>> print(f"Freed {result['bytes_freed'] / 1e9:.2f} GB")
            >>>
            >>> # Clean all except specific executions
            >>> result = ml.clean_execution_dirs(exclude_rids=['1-ABC', '1-DEF'])
        """
        import shutil
        import time

        from deriva_ml.dataset.upload import upload_root

        stats = {'dirs_removed': 0, 'bytes_freed': 0, 'errors': 0}
        exclude_rids = set(exclude_rids or [])

        exec_root = upload_root(self.working_dir) / "execution"
        if not exec_root.exists():
            return stats

        cutoff_time = None
        if older_than_days is not None:
            cutoff_time = time.time() - (older_than_days * 24 * 60 * 60)

        for entry in exec_root.iterdir():
            if not entry.is_dir():
                continue

            # Skip excluded RIDs
            if entry.name in exclude_rids:
                continue

            try:
                # Check age if filtering
                if cutoff_time is not None:
                    entry_mtime = entry.stat().st_mtime
                    if entry_mtime > cutoff_time:
                        continue

                # Calculate size before removal
                entry_size = sum(f.stat().st_size for f in entry.rglob('*') if f.is_file())
                shutil.rmtree(entry)
                stats['dirs_removed'] += 1
                stats['bytes_freed'] += entry_size

            except (OSError, PermissionError) as e:
                self._logger.warning(f"Failed to remove execution dir {entry}: {e}")
                stats['errors'] += 1

        return stats

    def get_storage_summary(self) -> dict[str, any]:
        """Get a summary of local storage usage.

        Returns:
            dict with keys:
                - 'working_dir': Path to working directory
                - 'cache_dir': Path to cache directory
                - 'cache_size_mb': Cache size in MB
                - 'cache_file_count': Number of files in cache
                - 'execution_dir_count': Number of execution directories
                - 'execution_size_mb': Total size of execution directories in MB
                - 'total_size_mb': Combined size in MB

        Example:
            >>> ml = DerivaML('deriva.example.org', 'my_catalog')
            >>> summary = ml.get_storage_summary()
            >>> print(f"Total storage: {summary['total_size_mb']:.1f} MB")
            >>> print(f"  Cache: {summary['cache_size_mb']:.1f} MB")
            >>> print(f"  Executions: {summary['execution_size_mb']:.1f} MB")
        """
        cache_stats = self.get_cache_size()
        exec_dirs = self.list_execution_dirs()

        exec_size_mb = sum(d['size_mb'] for d in exec_dirs)

        return {
            'working_dir': str(self.working_dir),
            'cache_dir': str(self.cache_dir),
            'cache_size_mb': cache_stats['total_mb'],
            'cache_file_count': cache_stats['file_count'],
            'execution_dir_count': len(exec_dirs),
            'execution_size_mb': exec_size_mb,
            'total_size_mb': cache_stats['total_mb'] + exec_size_mb,
        }

    # =========================================================================
    # Schema Validation
    # =========================================================================

    def validate_schema(self, strict: bool = False) -> "SchemaValidationReport":
        """Validate that the catalog's ML schema matches the expected structure.

        This method inspects the catalog schema and verifies that it contains all
        the required tables, columns, vocabulary terms, and relationships that are
        created by the ML schema initialization routines in create_schema.py.

        The validation checks:
        - All required ML tables exist (Dataset, Execution, Workflow, etc.)
        - All required columns exist with correct types
        - All required vocabulary tables exist (Asset_Type, Dataset_Type, etc.)
        - All required vocabulary terms are initialized
        - All association tables exist for relationships

        In strict mode, the validator also reports errors for:
        - Extra tables not in the expected schema
        - Extra columns not in the expected table definitions

        Args:
            strict: If True, extra tables and columns are reported as errors.
                   If False (default), they are reported as informational items.
                   Use strict=True to verify a clean ML catalog matches exactly.
                   Use strict=False to validate a catalog that may have domain extensions.

        Returns:
            SchemaValidationReport with validation results. Key attributes:
                - is_valid: True if no errors were found
                - errors: List of error-level issues
                - warnings: List of warning-level issues
                - info: List of informational items
                - to_text(): Human-readable report
                - to_dict(): JSON-serializable dictionary

        Example:
            >>> ml = DerivaML('localhost', 'my_catalog')
            >>> report = ml.validate_schema(strict=False)
            >>> if report.is_valid:
            ...     print("Schema is valid!")
            ... else:
            ...     print(report.to_text())

            >>> # Strict validation for a fresh ML catalog
            >>> report = ml.validate_schema(strict=True)
            >>> print(f"Found {len(report.errors)} errors, {len(report.warnings)} warnings")

            >>> # Get report as dictionary for JSON/logging
            >>> import json
            >>> print(json.dumps(report.to_dict(), indent=2))

        Note:
            This method validates the ML schema (typically 'deriva-ml'), not the
            domain schema. Domain-specific tables and columns are not checked
            unless they are part of the ML schema itself.

        See Also:
            - deriva_ml.schema.validation.SchemaValidationReport
            - deriva_ml.schema.validation.validate_ml_schema
        """
        from deriva_ml.schema.validation import validate_ml_schema
        return validate_ml_schema(self, strict=strict)

catalog_provenance property

catalog_provenance: (
    "CatalogProvenance | None"
)

Get the provenance information for this catalog.

Returns provenance information if the catalog has it set. This includes information about how the catalog was created (clone, create, schema), who created it, when, and any workflow information.

For cloned catalogs, additional details about the clone operation are available in the clone_details attribute.

Returns:

Type Description
'CatalogProvenance | None'

CatalogProvenance if available, None otherwise.

Example

ml = DerivaML('localhost', '45') prov = ml.catalog_provenance if prov: ... print(f"Created: {prov.created_at} by {prov.created_by}") ... print(f"Method: {prov.creation_method.value}") ... if prov.is_clone: ... print(f"Cloned from: {prov.clone_details.source_hostname}")

working_data property

working_data

Access the working data cache for this catalog.

Returns a :class:WorkingDataCache backed by a SQLite database in the working directory. Use this to cache catalog query results (tables, denormalized views, feature values) for reuse across scripts.

Example::

# Cache a full table
df = ml.cache_table("Subject")

# Check what's cached
ml.working_data.list_tables()

# Clear the cache
ml.working_data.clear()

__del__

__del__() -> None

Cleanup method to handle incomplete executions.

Source code in src/deriva_ml/core/base.py
340
341
342
343
344
345
346
347
def __del__(self) -> None:
    """Cleanup method to handle incomplete executions."""
    try:
        # Mark execution as aborted if not completed
        if self._execution and self._execution.status != Status.completed:
            self._execution.update_status(Status.aborted, "Execution Aborted")
    except (AttributeError, requests.HTTPError):
        pass

__init__

__init__(
    hostname: str,
    catalog_id: str | int,
    domain_schemas: str
    | set[str]
    | None = None,
    default_schema: str | None = None,
    project_name: str | None = None,
    cache_dir: str | Path | None = None,
    working_dir: str
    | Path
    | None = None,
    hydra_runtime_output_dir: str
    | Path
    | None = None,
    ml_schema: str = ML_SCHEMA,
    logging_level: int = logging.WARNING,
    deriva_logging_level: int = logging.WARNING,
    credential: dict | None = None,
    s3_bucket: str | None = None,
    use_minid: bool | None = None,
    check_auth: bool = True,
    clean_execution_dir: bool = True,
) -> None

Initializes a DerivaML instance.

This method will connect to a catalog and initialize local configuration for the ML execution. This class is intended to be used as a base class on which domain-specific interfaces are built.

Parameters:

Name Type Description Default
hostname str

Hostname of the Deriva server.

required
catalog_id str | int

Catalog ID. Either an identifier or a catalog name.

required
domain_schemas str | set[str] | None

Optional set of domain schema names. If None, auto-detects all non-system schemas. Use this when working with catalogs that have multiple user-defined schemas.

None
default_schema str | None

The default schema for table creation operations. If None and there is exactly one domain schema, that schema is used. If there are multiple domain schemas, this must be specified for table creation to work without explicit schema parameters.

None
ml_schema str

Schema name for ML schema. Used if you have a non-standard configuration of deriva-ml.

ML_SCHEMA
project_name str | None

Project name. Defaults to name of default_schema.

None
cache_dir str | Path | None

Directory path for caching data downloaded from the Deriva server as bdbag. If not provided, will default to working_dir.

None
working_dir str | Path | None

Directory path for storing data used by or generated by any computations. If no value is provided, will default to ${HOME}/deriva_ml

None
s3_bucket str | None

S3 bucket URL for dataset bag storage (e.g., 's3://my-bucket'). If provided, enables MINID creation and S3 upload for dataset exports. If None, MINID functionality is disabled regardless of use_minid setting.

None
use_minid bool | None

Use the MINID service when downloading dataset bags. Only effective when s3_bucket is configured. If None (default), automatically set to True when s3_bucket is provided, False otherwise.

None
check_auth bool

Check if the user has access to the catalog.

True
clean_execution_dir bool

Whether to automatically clean up execution working directories after successful upload. Defaults to True. Set to False to retain local copies.

True
Source code in src/deriva_ml/core/base.py
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
def __init__(
    self,
    hostname: str,
    catalog_id: str | int,
    domain_schemas: str | set[str] | None = None,
    default_schema: str | None = None,
    project_name: str | None = None,
    cache_dir: str | Path | None = None,
    working_dir: str | Path | None = None,
    hydra_runtime_output_dir: str | Path | None = None,
    ml_schema: str = ML_SCHEMA,
    logging_level: int = logging.WARNING,
    deriva_logging_level: int = logging.WARNING,
    credential: dict | None = None,
    s3_bucket: str | None = None,
    use_minid: bool | None = None,
    check_auth: bool = True,
    clean_execution_dir: bool = True,
) -> None:
    """Initializes a DerivaML instance.

    This method will connect to a catalog and initialize local configuration for the ML execution.
    This class is intended to be used as a base class on which domain-specific interfaces are built.

    Args:
        hostname: Hostname of the Deriva server.
        catalog_id: Catalog ID. Either an identifier or a catalog name.
        domain_schemas: Optional set of domain schema names. If None, auto-detects all
            non-system schemas. Use this when working with catalogs that have multiple
            user-defined schemas.
        default_schema: The default schema for table creation operations. If None and
            there is exactly one domain schema, that schema is used. If there are multiple
            domain schemas, this must be specified for table creation to work without
            explicit schema parameters.
        ml_schema: Schema name for ML schema. Used if you have a non-standard configuration of deriva-ml.
        project_name: Project name. Defaults to name of default_schema.
        cache_dir: Directory path for caching data downloaded from the Deriva server as bdbag. If not provided,
            will default to working_dir.
        working_dir: Directory path for storing data used by or generated by any computations. If no value is
            provided, will default to  ${HOME}/deriva_ml
        s3_bucket: S3 bucket URL for dataset bag storage (e.g., 's3://my-bucket'). If provided,
            enables MINID creation and S3 upload for dataset exports. If None, MINID functionality
            is disabled regardless of use_minid setting.
        use_minid: Use the MINID service when downloading dataset bags. Only effective when
            s3_bucket is configured. If None (default), automatically set to True when s3_bucket
            is provided, False otherwise.
        check_auth: Check if the user has access to the catalog.
        clean_execution_dir: Whether to automatically clean up execution working directories
            after successful upload. Defaults to True. Set to False to retain local copies.
    """
    # Get or use provided credentials for server access
    self.credential = credential or get_credential(hostname)

    # Initialize server connection and catalog access
    server = DerivaServer(
        "https",
        hostname,
        credentials=self.credential,
        session_config=self._get_session_config(),
    )
    try:
        if check_auth and server.get_authn_session():
            pass
    except Exception:
        raise DerivaMLException(
            "You are not authorized to access this catalog. "
            "Please check your credentials and make sure you have logged in."
        )
    self.catalog = server.connect_ermrest(catalog_id)
    # Import here to avoid circular imports
    from deriva_ml.model.catalog import DerivaModel
    self.model = DerivaModel(
        self.catalog.getCatalogModel(),
        ml_schema=ml_schema,
        domain_schemas=domain_schemas,
        default_schema=default_schema,
    )

    # Store S3 bucket configuration and resolve use_minid
    self.s3_bucket = s3_bucket
    if use_minid is None:
        # Auto mode: enable MINID if s3_bucket is configured
        self.use_minid = s3_bucket is not None
    elif use_minid and s3_bucket is None:
        # User requested MINID but no S3 bucket configured - disable MINID
        self.use_minid = False
    else:
        self.use_minid = use_minid

    # Set up working and cache directories
    # If working_dir is already provided (e.g. from DerivaMLConfig.instantiate()),
    # use it directly; otherwise compute the default path.
    if working_dir is not None:
        self.working_dir = Path(working_dir).absolute()
    else:
        self.working_dir = DerivaMLConfig.compute_workdir(None, catalog_id, hostname)
    self.working_dir.mkdir(parents=True, exist_ok=True)
    self.hydra_runtime_output_dir = hydra_runtime_output_dir

    self.cache_dir = Path(cache_dir) if cache_dir else self.working_dir / "cache"
    self.cache_dir.mkdir(parents=True, exist_ok=True)

    # Set up logging using centralized configuration
    # This configures deriva_ml, Hydra, and deriva-py loggers without
    # affecting the root logger or calling basicConfig()
    self._logger = configure_logging(
        level=logging_level,
        deriva_level=deriva_logging_level,
    )
    self._logging_level = logging_level
    self._deriva_logging_level = deriva_logging_level

    # Apply deriva's default logger overrides for fine-grained control
    apply_logger_overrides(DEFAULT_LOGGER_OVERRIDES)

    # Store instance configuration
    self.host_name = hostname
    self.catalog_id = catalog_id
    self.ml_schema = ml_schema
    self.configuration = None
    self._execution: Execution | None = None
    self.domain_schemas = self.model.domain_schemas
    self.default_schema = self.model.default_schema
    self.project_name = project_name or self.default_schema or "deriva-ml"
    self.start_time = datetime.now()
    self.status = Status.pending.value
    self.clean_execution_dir = clean_execution_dir

add_dataset_element_type

add_dataset_element_type(
    element: str | Table,
) -> Table

Makes it possible to add objects from the specified table to a dataset.

A dataset is a heterogeneous collection of objects, each of which comes from a different table. This routine adds the specified table as a valid element type for datasets.

Parameters:

Name Type Description Default
element str | Table

Name of the table or table object that is to be added to the dataset.

required

Returns:

Type Description
Table

The table object that was added to the dataset.

Source code in src/deriva_ml/core/mixins/dataset.py
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def add_dataset_element_type(self, element: str | Table) -> Table:
    """Makes it possible to add objects from the specified table to a dataset.

    A dataset is a heterogeneous collection of objects, each of which comes from a different table.
    This routine adds the specified table as a valid element type for datasets.

    Args:
        element: Name of the table or table object that is to be added to the dataset.

    Returns:
        The table object that was added to the dataset.
    """
    # Import here to avoid circular imports
    from deriva_ml.dataset.catalog_graph import CatalogGraph

    # Add table to map.
    element_table = self.model.name_to_table(element)
    atable_def = self.model._define_association(
        associates=[self._dataset_table, element_table],
    )
    try:
        table = self.model.create_table(atable_def)
    except ValueError as e:
        if "already exists" in str(e):
            table = self.model.name_to_table(atable_def["table_name"])
        else:
            raise e

    # self.model = self.catalog.getCatalogModel()
    annotations = CatalogGraph(self, s3_bucket=self.s3_bucket, use_minid=self.use_minid).generate_dataset_download_annotations()  # type: ignore[arg-type]
    self._dataset_table.annotations.update(annotations)
    self.model.model.apply()
    return table

add_features

add_features(
    features: list[FeatureRecord],
) -> int

Add feature values to the catalog in batch.

Inserts a list of FeatureRecord instances into the appropriate feature table. All records must be from the same feature (i.e., created by the same feature_record_class()). Records are batch-inserted for efficiency.

Parameters:

Name Type Description Default
features list[FeatureRecord]

List of FeatureRecord instances to insert. All must share the same feature definition (same feature class variable). Create records using the class returned by Feature.feature_record_class().

required

Returns:

Type Description
int

Number of feature records inserted.

Raises:

Type Description
ValueError

If features list is empty.

Example

feature = ml.lookup_feature("Image", "Classification") RecordClass = feature.feature_record_class() records = [ ... RecordClass(Image="1-ABC", Image_Class="Normal", Execution=exe_rid), ... RecordClass(Image="1-DEF", Image_Class="Abnormal", Execution=exe_rid), ... ] count = ml.add_features(records) print(f"Inserted {count} feature values")

Source code in src/deriva_ml/core/mixins/feature.py
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def add_features(self, features: list[FeatureRecord]) -> int:
    """Add feature values to the catalog in batch.

    Inserts a list of FeatureRecord instances into the appropriate feature table.
    All records must be from the same feature (i.e., created by the same
    ``feature_record_class()``). Records are batch-inserted for efficiency.

    Args:
        features: List of FeatureRecord instances to insert. All must share
            the same feature definition (same ``feature`` class variable).
            Create records using the class returned by
            ``Feature.feature_record_class()``.

    Returns:
        Number of feature records inserted.

    Raises:
        ValueError: If features list is empty.

    Example:
        >>> feature = ml.lookup_feature("Image", "Classification")
        >>> RecordClass = feature.feature_record_class()
        >>> records = [
        ...     RecordClass(Image="1-ABC", Image_Class="Normal", Execution=exe_rid),
        ...     RecordClass(Image="1-DEF", Image_Class="Abnormal", Execution=exe_rid),
        ... ]
        >>> count = ml.add_features(records)
        >>> print(f"Inserted {count} feature values")
    """
    if not features:
        raise ValueError("features list must not be empty")

    feature_table = features[0].feature.feature_table
    feature_path = self.pathBuilder().schemas[feature_table.schema.name].tables[feature_table.name]
    entries = feature_path.insert([f.model_dump() for f in features])
    return len(list(entries))

add_files

add_files(
    files: Iterable[FileSpec],
    execution_rid: RID,
    dataset_types: str
    | list[str]
    | None = None,
    description: str = "",
) -> "Dataset"

Adds files to the catalog with their metadata.

Registers files in the catalog along with their metadata (MD5, length, URL) and associates them with specified file types. Links files to the specified execution record for provenance tracking.

Parameters:

Name Type Description Default
files Iterable[FileSpec]

File specifications containing MD5 checksum, length, and URL.

required
execution_rid RID

Execution RID to associate files with (required for provenance).

required
dataset_types str | list[str] | None

One or more dataset type terms from File_Type vocabulary.

None
description str

Description of the files.

''

Returns:

Name Type Description
Dataset 'Dataset'

Dataset that represents the newly added files.

Raises:

Type Description
DerivaMLException

If file_types are invalid or execution_rid is not an execution record.

Examples:

Add files via an execution: >>> with ml.create_execution(config) as exe: ... files = [FileSpec(url="path/to/file.txt", md5="abc123", length=1000)] ... dataset = exe.add_files(files, dataset_types="text")

Source code in src/deriva_ml/core/mixins/file.py
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
def add_files(
    self,
    files: Iterable[FileSpec],
    execution_rid: RID,
    dataset_types: str | list[str] | None = None,
    description: str = "",
) -> "Dataset":
    """Adds files to the catalog with their metadata.

    Registers files in the catalog along with their metadata (MD5, length, URL) and associates them with
    specified file types. Links files to the specified execution record for provenance tracking.

    Args:
        files: File specifications containing MD5 checksum, length, and URL.
        execution_rid: Execution RID to associate files with (required for provenance).
        dataset_types: One or more dataset type terms from File_Type vocabulary.
        description: Description of the files.

    Returns:
        Dataset: Dataset that represents the newly added files.

    Raises:
        DerivaMLException: If file_types are invalid or execution_rid is not an execution record.

    Examples:
        Add files via an execution:
            >>> with ml.create_execution(config) as exe:
            ...     files = [FileSpec(url="path/to/file.txt", md5="abc123", length=1000)]
            ...     dataset = exe.add_files(files, dataset_types="text")
    """
    # Import here to avoid circular imports
    from deriva_ml.dataset.dataset import Dataset

    if self.resolve_rid(execution_rid).table.name != "Execution":
        raise DerivaMLTableTypeError("Execution", execution_rid)

    filespec_list = list(files)

    # Get a list of all defined file types and their synonyms.
    defined_types = set(
        chain.from_iterable([[t.name] + list(t.synonyms or []) for t in self.list_vocabulary_terms(MLVocab.asset_type)])
    )

    # Get a list of all of the file types used in the filespec_list
    spec_types = set(chain.from_iterable(filespec.file_types for filespec in filespec_list))

    # Now make sure that all of the file types and dataset_types in the spec list are defined.
    if spec_types - defined_types:
        raise DerivaMLInvalidTerm(MLVocab.asset_type.name, f"{spec_types - defined_types}")

    # Normalize dataset_types, make sure File type is included.
    if isinstance(dataset_types, list):
        dataset_types = ["File"] + dataset_types if "File" not in dataset_types else dataset_types
    else:
        dataset_types = ["File", dataset_types] if dataset_types else ["File"]
    for ds_type in dataset_types:
        self.lookup_term(MLVocab.dataset_type, ds_type)

    # Add files to the file table, and collect up the resulting entries by directory name.
    pb = self.pathBuilder()
    file_records = list(
        pb.schemas[self.ml_schema].tables["File"].insert([f.model_dump(by_alias=True) for f in filespec_list])
    )

    # Get the name of the association table between file_table and file_type and add file_type records
    atable = self.model.find_association(MLTable.file, MLVocab.asset_type)[0].name
    # Need to get a link between file record and file_types.
    type_map = {
        file_spec.md5: file_spec.file_types + ([] if "File" in file_spec.file_types else [])
        for file_spec in filespec_list
    }
    file_type_records = [
        {MLVocab.asset_type.value: file_type, "File": file_record["RID"]}
        for file_record in file_records
        for file_type in type_map[file_record["MD5"]]
    ]
    pb.schemas[self.ml_schema].tables[atable].insert(file_type_records)

    # Link files to the execution for provenance tracking.
    pb.schemas[self.ml_schema].File_Execution.insert(
        [
            {"File": file_record["RID"], "Execution": execution_rid, "Asset_Role": "Output"}
            for file_record in file_records
        ]
    )

    # Now create datasets to capture the original directory structure of the files.
    dir_rid_map = defaultdict(list)
    for e in file_records:
        dir_rid_map[Path(urlsplit(e["URL"]).path).parent].append(e["RID"])

    nested_datasets = []
    path_length = 0
    dataset = None
    # Start with the longest path so we get subdirectories first.
    for p, rids in sorted(dir_rid_map.items(), key=lambda kv: len(kv[0].parts), reverse=True):
        dataset = Dataset.create_dataset(
            self,  # type: ignore[arg-type]
            dataset_types=dataset_types,
            execution_rid=execution_rid,
            description=description,
        )
        members = rids
        if len(p.parts) < path_length:
            # Going up one level in directory, so Create nested dataset
            members = [m.dataset_rid for m in nested_datasets] + rids
            nested_datasets = []
        dataset.add_dataset_members(members=members, execution_rid=execution_rid)
        nested_datasets.append(dataset)
        path_length = len(p.parts)

    return dataset

add_page

add_page(
    title: str, content: str
) -> None

Adds page to web interface.

Creates a new page in the catalog's web interface with the specified title and content. The page will be accessible through the catalog's navigation system.

Parameters:

Name Type Description Default
title str

The title of the page to be displayed in navigation and headers.

required
content str

The main content of the page can include HTML markup.

required

Raises:

Type Description
DerivaMLException

If the page creation fails or the user lacks necessary permissions.

Example

ml.add_page( ... title="Analysis Results", ... content="

Results

Analysis completed successfully...

" ... )

Source code in src/deriva_ml/core/base.py
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
def add_page(self, title: str, content: str) -> None:
    """Adds page to web interface.

    Creates a new page in the catalog's web interface with the specified title and content. The page will be
    accessible through the catalog's navigation system.

    Args:
        title: The title of the page to be displayed in navigation and headers.
        content: The main content of the page can include HTML markup.

    Raises:
        DerivaMLException: If the page creation fails or the user lacks necessary permissions.

    Example:
        >>> ml.add_page(
        ...     title="Analysis Results",
        ...     content="<h1>Results</h1><p>Analysis completed successfully...</p>"
        ... )
    """
    # Insert page into www tables with title and content
    # Use default schema or first domain schema for www tables
    schema = self.default_schema or (sorted(self.domain_schemas)[0] if self.domain_schemas else None)
    if schema is None:
        raise DerivaMLException("No domain schema available for adding pages")
    self.pathBuilder().www.tables[schema].insert([{"Title": title, "Content": content}])

add_term

add_term(
    table: str | Table,
    term_name: str,
    description: str,
    synonyms: list[str] | None = None,
    exists_ok: bool = True,
) -> VocabularyTermHandle

Adds a term to a vocabulary table.

Creates a new standardized term with description and optional synonyms in a vocabulary table. Can either create a new term or return an existing one if it already exists.

Parameters:

Name Type Description Default
table str | Table

Vocabulary table to add term to (name or Table object).

required
term_name str

Primary name of the term (must be unique within vocabulary).

required
description str

Explanation of term's meaning and usage.

required
synonyms list[str] | None

Alternative names for the term.

None
exists_ok bool

If True, return the existing term if found. If False, raise error.

True

Returns:

Name Type Description
VocabularyTermHandle VocabularyTermHandle

Object representing the created or existing term, with methods to modify it in the catalog.

Raises:

Type Description
DerivaMLException

If a term exists and exists_ok=False, or if the table is not a vocabulary table.

Examples:

Add a new tissue type: >>> term = ml.add_term( ... table="tissue_types", ... term_name="epithelial", ... description="Epithelial tissue type", ... synonyms=["epithelium"] ... ) >>> # Modify the term >>> term.description = "Updated description" >>> term.synonyms = ("epithelium", "epithelial_tissue")

Attempt to add an existing term: >>> term = ml.add_term("tissue_types", "epithelial", "...", exists_ok=True)

Source code in src/deriva_ml/core/mixins/vocabulary.py
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def add_term(
    self,
    table: str | Table,
    term_name: str,
    description: str,
    synonyms: list[str] | None = None,
    exists_ok: bool = True,
) -> VocabularyTermHandle:
    """Adds a term to a vocabulary table.

    Creates a new standardized term with description and optional synonyms in a vocabulary table.
    Can either create a new term or return an existing one if it already exists.

    Args:
        table: Vocabulary table to add term to (name or Table object).
        term_name: Primary name of the term (must be unique within vocabulary).
        description: Explanation of term's meaning and usage.
        synonyms: Alternative names for the term.
        exists_ok: If True, return the existing term if found. If False, raise error.

    Returns:
        VocabularyTermHandle: Object representing the created or existing term, with
            methods to modify it in the catalog.

    Raises:
        DerivaMLException: If a term exists and exists_ok=False, or if the table is not a vocabulary table.

    Examples:
        Add a new tissue type:
            >>> term = ml.add_term(
            ...     table="tissue_types",
            ...     term_name="epithelial",
            ...     description="Epithelial tissue type",
            ...     synonyms=["epithelium"]
            ... )
            >>> # Modify the term
            >>> term.description = "Updated description"
            >>> term.synonyms = ("epithelium", "epithelial_tissue")

        Attempt to add an existing term:
            >>> term = ml.add_term("tissue_types", "epithelial", "...", exists_ok=True)
    """
    # Initialize an empty synonyms list if None
    synonyms = synonyms or []

    # Get table reference and validate if it is a vocabulary table
    vocab_table = self.model.name_to_table(table)
    pb = self.pathBuilder()
    if not (self.model.is_vocabulary(vocab_table)):
        raise DerivaMLTableTypeError("vocabulary", vocab_table.name)

    # Get schema and table names for path building
    schema_name = vocab_table.schema.name
    table_name = vocab_table.name
    cols = self.model.vocab_columns(vocab_table)

    try:
        # Attempt to insert a new term
        term_data = pb.schemas[schema_name].tables[table_name].insert(
            [
                {
                    cols["Name"]: term_name,
                    cols["Description"]: description,
                    cols["Synonyms"]: synonyms,
                }
            ],
            defaults={cols["ID"], cols["URI"]},
        )[0]
        term_handle = VocabularyTermHandle(ml=self, table=table_name, **term_data)
        # Invalidate cache for this vocabulary since we added a new term
        self.clear_vocabulary_cache(vocab_table)
        return term_handle
    except DataPathException as e:
        # Insert failed — check if it's because the term already exists
        # or because of some other database error (permissions, schema, etc.)
        try:
            existing_term = self.lookup_term(vocab_table, term_name)
        except DerivaMLInvalidTerm:
            # Term doesn't exist — the insert failed for another reason
            raise DerivaMLException(
                f"Failed to insert term '{term_name}' into {vocab_table.name}: {e}"
            ) from e
        # Term does exist — either return it or raise depending on exists_ok
        if not exists_ok:
            raise DerivaMLInvalidTerm(vocab_table.name, term_name, msg="term already exists")
        return existing_term

add_visible_column

add_visible_column(
    table: str | Table,
    context: str,
    column: str
    | list[str]
    | dict[str, Any],
    position: int | None = None,
) -> list[Any]

Add a column to the visible-columns list for a specific context.

Convenience method for adding columns without replacing the entire visible-columns annotation. Changes are staged until apply_annotations() is called.

Parameters:

Name Type Description Default
table str | Table

Table name or Table object.

required
context str

The context to modify (e.g., "compact", "detailed", "entry").

required
column str | list[str] | dict[str, Any]

Column to add. Can be: - String: column name (e.g., "Filename") - List: foreign key reference (e.g., ["schema", "fkey_name"]) - Dict: pseudo-column definition

required
position int | None

Position to insert at (0-indexed). If None, appends to end.

None

Returns:

Type Description
list[Any]

The updated column list for the context.

Raises:

Type Description
DerivaMLException

If context references another context.

Example

ml.add_visible_column("Image", "compact", "Description") ml.add_visible_column("Image", "detailed", ["domain", "Image_Subject_fkey"], 1) ml.apply_annotations()

Source code in src/deriva_ml/core/mixins/annotation.py
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def add_visible_column(
    self,
    table: str | Table,
    context: str,
    column: str | list[str] | dict[str, Any],
    position: int | None = None,
) -> list[Any]:
    """Add a column to the visible-columns list for a specific context.

    Convenience method for adding columns without replacing the entire
    visible-columns annotation. Changes are staged until apply_annotations()
    is called.

    Args:
        table: Table name or Table object.
        context: The context to modify (e.g., "compact", "detailed", "entry").
        column: Column to add. Can be:
            - String: column name (e.g., "Filename")
            - List: foreign key reference (e.g., ["schema", "fkey_name"])
            - Dict: pseudo-column definition
        position: Position to insert at (0-indexed). If None, appends to end.

    Returns:
        The updated column list for the context.

    Raises:
        DerivaMLException: If context references another context.

    Example:
        >>> ml.add_visible_column("Image", "compact", "Description")
        >>> ml.add_visible_column("Image", "detailed", ["domain", "Image_Subject_fkey"], 1)
        >>> ml.apply_annotations()
    """
    table_obj = self.model.name_to_table(table)

    # Get or create visible_columns annotation
    visible_cols = table_obj.annotations.get(VISIBLE_COLUMNS_TAG, {})
    if visible_cols is None:
        visible_cols = {}

    # Get or create the context list
    context_list = visible_cols.get(context, [])
    if isinstance(context_list, str):
        raise DerivaMLException(
            f"Context '{context}' references another context '{context_list}'. "
            "Set it explicitly first with set_visible_columns()."
        )

    # Make a copy to avoid modifying in place
    context_list = list(context_list)

    # Insert at position or append
    if position is not None:
        context_list.insert(position, column)
    else:
        context_list.append(column)

    # Update the annotation
    visible_cols[context] = context_list
    table_obj.annotations[VISIBLE_COLUMNS_TAG] = visible_cols

    return context_list

add_visible_foreign_key

add_visible_foreign_key(
    table: str | Table,
    context: str,
    foreign_key: list[str]
    | dict[str, Any],
    position: int | None = None,
) -> list[Any]

Add a foreign key to the visible-foreign-keys list for a specific context.

Convenience method for adding related tables without replacing the entire visible-foreign-keys annotation. Changes are staged until apply_annotations() is called.

Parameters:

Name Type Description Default
table str | Table

Table name or Table object.

required
context str

The context to modify (typically "detailed" or "*").

required
foreign_key list[str] | dict[str, Any]

Foreign key to add. Can be: - List: inbound foreign key reference (e.g., ["schema", "Other_Table_fkey"]) - Dict: pseudo-column definition for complex relationships

required
position int | None

Position to insert at (0-indexed). If None, appends to end.

None

Returns:

Type Description
list[Any]

The updated foreign key list for the context.

Raises:

Type Description
DerivaMLException

If context references another context.

Example

ml.add_visible_foreign_key("Subject", "detailed", ["domain", "Image_Subject_fkey"]) ml.apply_annotations()

Source code in src/deriva_ml/core/mixins/annotation.py
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def add_visible_foreign_key(
    self,
    table: str | Table,
    context: str,
    foreign_key: list[str] | dict[str, Any],
    position: int | None = None,
) -> list[Any]:
    """Add a foreign key to the visible-foreign-keys list for a specific context.

    Convenience method for adding related tables without replacing the entire
    visible-foreign-keys annotation. Changes are staged until apply_annotations()
    is called.

    Args:
        table: Table name or Table object.
        context: The context to modify (typically "detailed" or "*").
        foreign_key: Foreign key to add. Can be:
            - List: inbound foreign key reference (e.g., ["schema", "Other_Table_fkey"])
            - Dict: pseudo-column definition for complex relationships
        position: Position to insert at (0-indexed). If None, appends to end.

    Returns:
        The updated foreign key list for the context.

    Raises:
        DerivaMLException: If context references another context.

    Example:
        >>> ml.add_visible_foreign_key("Subject", "detailed", ["domain", "Image_Subject_fkey"])
        >>> ml.apply_annotations()
    """
    table_obj = self.model.name_to_table(table)

    # Get or create visible_foreign_keys annotation
    visible_fkeys = table_obj.annotations.get(VISIBLE_FOREIGN_KEYS_TAG, {})
    if visible_fkeys is None:
        visible_fkeys = {}

    # Get or create the context list
    context_list = visible_fkeys.get(context, [])
    if isinstance(context_list, str):
        raise DerivaMLException(
            f"Context '{context}' references another context '{context_list}'. "
            "Set it explicitly first with set_visible_foreign_keys()."
        )

    # Make a copy to avoid modifying in place
    context_list = list(context_list)

    # Insert at position or append
    if position is not None:
        context_list.insert(position, foreign_key)
    else:
        context_list.append(foreign_key)

    # Update the annotation
    visible_fkeys[context] = context_list
    table_obj.annotations[VISIBLE_FOREIGN_KEYS_TAG] = visible_fkeys

    return context_list

add_workflow

add_workflow(workflow: Workflow) -> RID

Adds a workflow to the catalog.

Registers a new workflow in the catalog or returns the RID of an existing workflow with the same URL or checksum.

Each workflow represents a specific computational process or analysis pipeline.

Parameters:

Name Type Description Default
workflow Workflow

Workflow object containing name, URL, type, version, and description.

required

Returns:

Name Type Description
RID RID

Resource Identifier of the added or existing workflow.

Raises:

Type Description
DerivaMLException

If workflow insertion fails or required fields are missing.

Examples:

>>> workflow = Workflow(
...     name="Gene Analysis",
...     url="https://github.com/org/repo/workflows/gene_analysis.py",
...     workflow_type="python_script",
...     version="1.0.0",
...     description="Analyzes gene expression patterns"
... )
>>> workflow_rid = ml.add_workflow(workflow)
Source code in src/deriva_ml/core/mixins/workflow.py
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
def add_workflow(self, workflow: Workflow) -> RID:
    """Adds a workflow to the catalog.

    Registers a new workflow in the catalog or returns the RID of an existing workflow with the same
    URL or checksum.

    Each workflow represents a specific computational process or analysis pipeline.

    Args:
        workflow: Workflow object containing name, URL, type, version, and description.

    Returns:
        RID: Resource Identifier of the added or existing workflow.

    Raises:
        DerivaMLException: If workflow insertion fails or required fields are missing.

    Examples:
        >>> workflow = Workflow(
        ...     name="Gene Analysis",
        ...     url="https://github.com/org/repo/workflows/gene_analysis.py",
        ...     workflow_type="python_script",
        ...     version="1.0.0",
        ...     description="Analyzes gene expression patterns"
        ... )
        >>> workflow_rid = ml.add_workflow(workflow)
    """
    # Check if a workflow already exists by URL or checksum
    if workflow_rid := self._find_workflow_rid_by_url(workflow.checksum or workflow.url):
        return workflow_rid

    # Get an ML schema path for the workflow table
    ml_schema_path = self.pathBuilder().schemas[self.ml_schema]

    try:
        # Create a workflow record (without Workflow_Type column)
        workflow_record = {
            "URL": workflow.url,
            "Name": workflow.name,
            "Description": workflow.description,
            "Checksum": workflow.checksum,
            "Version": workflow.version,
        }
        # Insert a workflow and get its RID
        workflow_rid = ml_schema_path.Workflow.insert([workflow_record])[0]["RID"]

        # Insert workflow type associations
        assoc_path = ml_schema_path.Workflow_Workflow_Type
        for wt in workflow.workflow_type:
            type_name = self.lookup_term(MLVocab.workflow_type, wt).name
            assoc_path.insert([{"Workflow": workflow_rid, MLVocab.workflow_type: type_name}])
    except Exception as e:
        error = format_exception(e)
        raise DerivaMLException(f"Failed to insert workflow. Error: {error}")
    return workflow_rid

apply_annotations

apply_annotations() -> None

Apply all staged annotation changes to the catalog.

Commits any annotation changes made via set_display_annotation, set_visible_columns, set_visible_foreign_keys, set_table_display, or set_column_display to the remote catalog.

Example

ml.set_display_annotation("Image", {"name": "Images"}) ml.set_visible_columns("Image", {"compact": ["RID", "Filename"]}) ml.apply_annotations() # Commit all changes

Source code in src/deriva_ml/core/mixins/annotation.py
315
316
317
318
319
320
321
322
323
324
325
326
327
def apply_annotations(self) -> None:
    """Apply all staged annotation changes to the catalog.

    Commits any annotation changes made via set_display_annotation,
    set_visible_columns, set_visible_foreign_keys, set_table_display,
    or set_column_display to the remote catalog.

    Example:
        >>> ml.set_display_annotation("Image", {"name": "Images"})
        >>> ml.set_visible_columns("Image", {"compact": ["RID", "Filename"]})
        >>> ml.apply_annotations()  # Commit all changes
    """
    self.model.apply()

apply_catalog_annotations

apply_catalog_annotations(
    navbar_brand_text: str = "ML Data Browser",
    head_title: str = "Catalog ML",
) -> None

Apply catalog-level annotations including the navigation bar and display settings.

This method configures the Chaise web interface for the catalog. Chaise is Deriva's web-based data browser that provides a user-friendly interface for exploring and managing catalog data. This method sets up annotations that control how Chaise displays and organizes the catalog.

Navigation Bar Structure: The method creates a navigation bar with the following menus: - User Info: Links to Users, Groups, and RID Lease tables - Deriva-ML: Core ML tables (Workflow, Execution, Dataset, Dataset_Version, etc.) - WWW: Web content tables (Page, File) - {Domain Schema}: All domain-specific tables (excludes vocabularies and associations) - Vocabulary: All controlled vocabulary tables from both ML and domain schemas - Assets: All asset tables from both ML and domain schemas - Features: All feature tables with entries named "TableName:FeatureName" - Catalog Registry: Link to the ermrest registry - Documentation: Links to ML notebook instructions and Deriva-ML docs

Display Settings: - Underscores in table/column names displayed as spaces - System columns (RID) shown in compact and entry views - Default table set to Dataset - Faceting and record deletion enabled - Export configurations available to all users

Bulk Upload Configuration: Configures upload patterns for asset tables, enabling drag-and-drop file uploads through the Chaise interface.

Call this after creating the domain schema and all tables to initialize the catalog's web interface. The navigation menus are dynamically built based on the current schema structure, automatically organizing tables into appropriate categories.

Parameters:

Name Type Description Default
navbar_brand_text str

Text displayed in the navigation bar brand area.

'ML Data Browser'
head_title str

Title displayed in the browser tab.

'Catalog ML'
Example

ml = DerivaML('deriva.example.org', 'my_catalog')

After creating domain schema and tables...

ml.apply_catalog_annotations()

Or with custom branding:

ml.apply_catalog_annotations("My Project Browser", "My ML Project")

Source code in src/deriva_ml/core/base.py
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
def apply_catalog_annotations(
    self,
    navbar_brand_text: str = "ML Data Browser",
    head_title: str = "Catalog ML",
) -> None:
    """Apply catalog-level annotations including the navigation bar and display settings.

    This method configures the Chaise web interface for the catalog. Chaise is Deriva's
    web-based data browser that provides a user-friendly interface for exploring and
    managing catalog data. This method sets up annotations that control how Chaise
    displays and organizes the catalog.

    **Navigation Bar Structure**:
    The method creates a navigation bar with the following menus:
    - **User Info**: Links to Users, Groups, and RID Lease tables
    - **Deriva-ML**: Core ML tables (Workflow, Execution, Dataset, Dataset_Version, etc.)
    - **WWW**: Web content tables (Page, File)
    - **{Domain Schema}**: All domain-specific tables (excludes vocabularies and associations)
    - **Vocabulary**: All controlled vocabulary tables from both ML and domain schemas
    - **Assets**: All asset tables from both ML and domain schemas
    - **Features**: All feature tables with entries named "TableName:FeatureName"
    - **Catalog Registry**: Link to the ermrest registry
    - **Documentation**: Links to ML notebook instructions and Deriva-ML docs

    **Display Settings**:
    - Underscores in table/column names displayed as spaces
    - System columns (RID) shown in compact and entry views
    - Default table set to Dataset
    - Faceting and record deletion enabled
    - Export configurations available to all users

    **Bulk Upload Configuration**:
    Configures upload patterns for asset tables, enabling drag-and-drop file uploads
    through the Chaise interface.

    Call this after creating the domain schema and all tables to initialize the catalog's
    web interface. The navigation menus are dynamically built based on the current schema
    structure, automatically organizing tables into appropriate categories.

    Args:
        navbar_brand_text: Text displayed in the navigation bar brand area.
        head_title: Title displayed in the browser tab.

    Example:
        >>> ml = DerivaML('deriva.example.org', 'my_catalog')
        >>> # After creating domain schema and tables...
        >>> ml.apply_catalog_annotations()
        >>> # Or with custom branding:
        >>> ml.apply_catalog_annotations("My Project Browser", "My ML Project")
    """
    catalog_id = self.model.catalog.catalog_id
    ml_schema = self.ml_schema

    # Build domain schema menu items (one menu per domain schema)
    domain_schema_menus = []
    for domain_schema in sorted(self.domain_schemas):
        if domain_schema not in self.model.schemas:
            continue
        domain_schema_menus.append({
            "name": domain_schema,
            "children": [
                {
                    "name": tname,
                    "url": f"/chaise/recordset/#{catalog_id}/{domain_schema}:{tname}",
                }
                for tname in self.model.schemas[domain_schema].tables
                # Don't include controlled vocabularies, association tables, or feature tables.
                if not (
                    self.model.is_vocabulary(tname)
                    or self.model.is_association(tname, pure=False, max_arity=3)
                )
            ],
        })

    # Build vocabulary menu items (ML schema + all domain schemas)
    vocab_children = [{"name": f"{ml_schema} Vocabularies", "header": True}]
    vocab_children.extend([
        {
            "url": f"/chaise/recordset/#{catalog_id}/{ml_schema}:{tname}",
            "name": tname,
        }
        for tname in self.model.schemas[ml_schema].tables
        if self.model.is_vocabulary(tname)
    ])
    for domain_schema in sorted(self.domain_schemas):
        if domain_schema not in self.model.schemas:
            continue
        vocab_children.append({"name": f"{domain_schema} Vocabularies", "header": True})
        vocab_children.extend([
            {
                "url": f"/chaise/recordset/#{catalog_id}/{domain_schema}:{tname}",
                "name": tname,
            }
            for tname in self.model.schemas[domain_schema].tables
            if self.model.is_vocabulary(tname)
        ])

    # Build asset menu items (ML schema + all domain schemas)
    asset_children = [
        {
            "url": f"/chaise/recordset/#{catalog_id}/{ml_schema}:{tname}",
            "name": tname,
        }
        for tname in self.model.schemas[ml_schema].tables
        if self.model.is_asset(tname)
    ]
    for domain_schema in sorted(self.domain_schemas):
        if domain_schema not in self.model.schemas:
            continue
        asset_children.extend([
            {
                "url": f"/chaise/recordset/#{catalog_id}/{domain_schema}:{tname}",
                "name": tname,
            }
            for tname in self.model.schemas[domain_schema].tables
            if self.model.is_asset(tname)
        ])

    catalog_annotation = {
        deriva_tags.display: {"name_style": {"underline_space": True}},
        deriva_tags.chaise_config: {
            "headTitle": head_title,
            "navbarBrandText": navbar_brand_text,
            "systemColumnsDisplayEntry": ["RID"],
            "systemColumnsDisplayCompact": ["RID"],
            "defaultTable": {"table": "Dataset", "schema": "deriva-ml"},
            "deleteRecord": True,
            "showFaceting": True,
            "shareCiteAcls": True,
            "exportConfigsSubmenu": {"acls": {"show": ["*"], "enable": ["*"]}},
            "resolverImplicitCatalog": False,
            "navbarMenu": {
                "newTab": False,
                "children": [
                    {
                        "name": "User Info",
                        "children": [
                            {
                                "url": f"/chaise/recordset/#{catalog_id}/public:ERMrest_Client",
                                "name": "Users",
                            },
                            {
                                "url": f"/chaise/recordset/#{catalog_id}/public:ERMrest_Group",
                                "name": "Groups",
                            },
                            {
                                "url": f"/chaise/recordset/#{catalog_id}/public:ERMrest_RID_Lease",
                                "name": "ERMrest RID Lease",
                            },
                        ],
                    },
                    {  # All the primary tables in deriva-ml schema.
                        "name": "Deriva-ML",
                        "children": [
                            {
                                "url": f"/chaise/recordset/#{catalog_id}/{ml_schema}:Workflow",
                                "name": "Workflow",
                            },
                            {
                                "url": f"/chaise/recordset/#{catalog_id}/{ml_schema}:Execution",
                                "name": "Execution",
                            },
                            {
                                "url": f"/chaise/recordset/#{catalog_id}/{ml_schema}:Execution_Metadata",
                                "name": "Execution Metadata",
                            },
                            {
                                "url": f"/chaise/recordset/#{catalog_id}/{ml_schema}:Execution_Asset",
                                "name": "Execution Asset",
                            },
                            {
                                "url": f"/chaise/recordset/#{catalog_id}/{ml_schema}:Dataset",
                                "name": "Dataset",
                            },
                            {
                                "url": f"/chaise/recordset/#{catalog_id}/{ml_schema}:Dataset_Version",
                                "name": "Dataset Version",
                            },
                        ],
                    },
                    {  # WWW schema tables.
                        "name": "WWW",
                        "children": [
                            {
                                "url": f"/chaise/recordset/#{catalog_id}/WWW:Page",
                                "name": "Page",
                            },
                            {
                                "url": f"/chaise/recordset/#{catalog_id}/WWW:File",
                                "name": "File",
                            },
                        ],
                    },
                    *domain_schema_menus,  # One menu per domain schema
                    {  # Vocabulary menu with all controlled vocabularies.
                        "name": "Vocabulary",
                        "children": vocab_children,
                    },
                    {  # List of all asset tables.
                        "name": "Assets",
                        "children": asset_children,
                    },
                    {  # List of all feature tables in the catalog.
                        "name": "Features",
                        "children": [
                            {
                                "url": f"/chaise/recordset/#{catalog_id}/{f.feature_table.schema.name}:{f.feature_table.name}",
                                "name": f"{f.target_table.name}:{f.feature_name}",
                            }
                            for f in self.model.find_features()
                        ],
                    },
                    {
                        "url": "/chaise/recordset/#0/ermrest:registry@sort(RID)",
                        "name": "Catalog Registry",
                    },
                    {
                        "name": "Documentation",
                        "children": [
                            {
                                "url": "https://github.com/informatics-isi-edu/deriva-ml/blob/main/docs/ml_workflow_instruction.md",
                                "name": "ML Notebook Instruction",
                            },
                            {
                                "url": "https://informatics-isi-edu.github.io/deriva-ml/",
                                "name": "Deriva-ML Documentation",
                            },
                        ],
                    },
                ],
            },
        },
        deriva_tags.bulk_upload: bulk_upload_configuration(model=self.model),
    }
    self.model.annotations.update(catalog_annotation)
    self.model.apply()

asset_record_class

asset_record_class(
    asset_table_name: str,
) -> type

Create a dynamically generated Pydantic model for an asset table's metadata.

The returned class is a subclass of AssetRecord with fields derived from the asset table's metadata columns (non-system, non-standard-asset columns). Fields are typed according to their database column type, and nullable columns are Optional.

Follows the same pattern as Feature.feature_record_class().

Parameters:

Name Type Description Default
asset_table_name str

Name of the asset table (e.g., "Image", "Model").

required

Returns:

Type Description
type

An AssetRecord subclass with validated fields matching the table's metadata.

Example

ImageAsset = ml.asset_record_class("Image") record = ImageAsset(Subject="2-DEF", Acquisition_Date="2026-01-15") path = exe.asset_file_path("Image", "scan.jpg", metadata=record)

Source code in src/deriva_ml/core/mixins/asset.py
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
def asset_record_class(self, asset_table_name: str) -> type:
    """Create a dynamically generated Pydantic model for an asset table's metadata.

    The returned class is a subclass of AssetRecord with fields derived from
    the asset table's metadata columns (non-system, non-standard-asset columns).
    Fields are typed according to their database column type, and nullable columns
    are Optional.

    Follows the same pattern as ``Feature.feature_record_class()``.

    Args:
        asset_table_name: Name of the asset table (e.g., "Image", "Model").

    Returns:
        An AssetRecord subclass with validated fields matching the table's metadata.

    Example:
        >>> ImageAsset = ml.asset_record_class("Image")
        >>> record = ImageAsset(Subject="2-DEF", Acquisition_Date="2026-01-15")
        >>> path = exe.asset_file_path("Image", "scan.jpg", metadata=record)
    """
    from deriva_ml.asset.asset_record import asset_record_class
    return asset_record_class(self.model, asset_table_name)

bag_info

bag_info(
    dataset: "DatasetSpec",
) -> dict[str, Any]

Get comprehensive info about a dataset bag: size, contents, and cache status.

Combines the size estimate with local cache status. Use this to decide whether to prefetch a bag before running an experiment.

Parameters:

Name Type Description Default
dataset 'DatasetSpec'

Specification of the dataset, including version and optional exclude_tables.

required

Returns:

Type Description
dict[str, Any]

dict with keys: - tables: dict mapping table name to {row_count, is_asset, asset_bytes} - total_rows, total_asset_bytes, total_asset_size - cache_status: one of "not_cached", "cached_metadata_only", "cached_materialized", "cached_incomplete" - cache_path: local path to cached bag (if cached), else None

Source code in src/deriva_ml/core/mixins/dataset.py
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
def bag_info(
    self,
    dataset: "DatasetSpec",
) -> dict[str, Any]:
    """Get comprehensive info about a dataset bag: size, contents, and cache status.

    Combines the size estimate with local cache status. Use this to decide
    whether to prefetch a bag before running an experiment.

    Args:
        dataset: Specification of the dataset, including version and
            optional exclude_tables.

    Returns:
        dict with keys:
            - tables: dict mapping table name to {row_count, is_asset, asset_bytes}
            - total_rows, total_asset_bytes, total_asset_size
            - cache_status: one of "not_cached", "cached_metadata_only",
              "cached_materialized", "cached_incomplete"
            - cache_path: local path to cached bag (if cached), else None
    """
    if not self.model.is_dataset_rid(dataset.rid):
        raise DerivaMLTableTypeError("Dataset", dataset.rid)
    ds = self.lookup_dataset(dataset)
    return ds.bag_info(
        version=dataset.version,
        exclude_tables=dataset.exclude_tables,
    )

cache_dataset

cache_dataset(
    dataset: "DatasetSpec",
    materialize: bool = True,
) -> dict[str, Any]

Download a dataset bag into the local cache without creating an execution.

Use this to warm the cache before running experiments. No execution or provenance records are created.

Parameters:

Name Type Description Default
dataset 'DatasetSpec'

Specification of the dataset, including version and optional exclude_tables.

required
materialize bool

If True (default), download all asset files. If False, download only table metadata.

True

Returns:

Type Description
dict[str, Any]

dict with bag_info results after caching.

Source code in src/deriva_ml/core/mixins/dataset.py
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
def cache_dataset(
    self,
    dataset: "DatasetSpec",
    materialize: bool = True,
) -> dict[str, Any]:
    """Download a dataset bag into the local cache without creating an execution.

    Use this to warm the cache before running experiments. No execution or
    provenance records are created.

    Args:
        dataset: Specification of the dataset, including version and
            optional exclude_tables.
        materialize: If True (default), download all asset files. If False,
            download only table metadata.

    Returns:
        dict with bag_info results after caching.
    """
    if not self.model.is_dataset_rid(dataset.rid):
        raise DerivaMLTableTypeError("Dataset", dataset.rid)
    ds = self.lookup_dataset(dataset)
    return ds.cache(
        version=dataset.version,
        materialize=materialize,
        exclude_tables=dataset.exclude_tables,
        timeout=dataset.timeout,
        fetch_concurrency=dataset.fetch_concurrency,
    )

cache_features

cache_features(
    table_name: str,
    feature_name: str,
    force: bool = False,
    **kwargs,
) -> "pd.DataFrame"

Fetch feature values from the catalog and cache locally.

On first call, fetches all feature values and stores in the working data cache. Subsequent calls return cached data.

Parameters:

Name Type Description Default
table_name str

Table the feature is attached to (e.g., "Image").

required
feature_name str

Name of the feature (e.g., "Classification").

required
force bool

If True, re-fetch even if already cached.

False
**kwargs

Additional arguments passed to fetch_table_features (e.g., selector, workflow, execution).

{}

Returns:

Type Description
'pd.DataFrame'

DataFrame with feature value records.

Example::

labels = ml.cache_features("Image", "Classification")
print(labels["Diagnosis_Type"].value_counts())
Source code in src/deriva_ml/core/base.py
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
def cache_features(
    self,
    table_name: str,
    feature_name: str,
    force: bool = False,
    **kwargs,
) -> "pd.DataFrame":
    """Fetch feature values from the catalog and cache locally.

    On first call, fetches all feature values and stores in the working
    data cache. Subsequent calls return cached data.

    Args:
        table_name: Table the feature is attached to (e.g., "Image").
        feature_name: Name of the feature (e.g., "Classification").
        force: If True, re-fetch even if already cached.
        **kwargs: Additional arguments passed to ``fetch_table_features``
            (e.g., ``selector``, ``workflow``, ``execution``).

    Returns:
        DataFrame with feature value records.

    Example::

        labels = ml.cache_features("Image", "Classification")
        print(labels["Diagnosis_Type"].value_counts())
    """
    import pandas as pd

    cache_key = f"features_{table_name}_{feature_name}"
    if not force and self.working_data.has_table(cache_key):
        return self.working_data.read_table(cache_key)

    features = self.fetch_table_features(
        table_name, feature_name=feature_name, **kwargs
    )
    records = [
        r.model_dump(mode="json") for r in features.get(feature_name, [])
    ]
    df = pd.DataFrame(records)
    self.working_data.cache_table(cache_key, df)
    return df

cache_table

cache_table(
    table_name: str, force: bool = False
) -> "pd.DataFrame"

Fetch a table from the catalog and cache locally as SQLite.

On first call, fetches all rows from the catalog and stores in the working data cache. Subsequent calls return the cached data without contacting the catalog. Use force=True to re-fetch.

Parameters:

Name Type Description Default
table_name str

Name of the table to fetch (e.g., "Subject", "Image").

required
force bool

If True, re-fetch even if already cached.

False

Returns:

Type Description
'pd.DataFrame'

DataFrame with the table contents.

Example::

subjects = ml.cache_table("Subject")
print(f"{len(subjects)} subjects")

# Second call returns cached data instantly
subjects = ml.cache_table("Subject")
Source code in src/deriva_ml/core/base.py
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
def cache_table(self, table_name: str, force: bool = False) -> "pd.DataFrame":
    """Fetch a table from the catalog and cache locally as SQLite.

    On first call, fetches all rows from the catalog and stores in the
    working data cache. Subsequent calls return the cached data without
    contacting the catalog. Use ``force=True`` to re-fetch.

    Args:
        table_name: Name of the table to fetch (e.g., "Subject", "Image").
        force: If True, re-fetch even if already cached.

    Returns:
        DataFrame with the table contents.

    Example::

        subjects = ml.cache_table("Subject")
        print(f"{len(subjects)} subjects")

        # Second call returns cached data instantly
        subjects = ml.cache_table("Subject")
    """
    import pandas as pd

    if not force and self.working_data.has_table(table_name):
        return self.working_data.read_table(table_name)

    df = self.get_table_as_dataframe(table_name)
    self.working_data.cache_table(table_name, df)
    return df

catalog_snapshot

catalog_snapshot(
    version_snapshot: str,
) -> Self

Return a new DerivaML instance connected to a specific catalog snapshot.

Catalog snapshots provide a read-only, point-in-time view of the catalog. The snapshot identifier is typically obtained from a dataset version record.

Parameters:

Name Type Description Default
version_snapshot str

Snapshot identifier string (e.g., "2T-SXEH-JH4A"), usually the snapshot field from a :class:DatasetHistory entry.

required

Returns:

Type Description
Self

A new DerivaML instance connected to the specified catalog snapshot.

Source code in src/deriva_ml/core/base.py
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
def catalog_snapshot(self, version_snapshot: str) -> Self:
    """Return a new DerivaML instance connected to a specific catalog snapshot.

    Catalog snapshots provide a read-only, point-in-time view of the catalog.
    The snapshot identifier is typically obtained from a dataset version record.

    Args:
        version_snapshot: Snapshot identifier string (e.g., ``"2T-SXEH-JH4A"``),
            usually the ``snapshot`` field from a :class:`DatasetHistory` entry.

    Returns:
        A new DerivaML instance connected to the specified catalog snapshot.
    """
    return DerivaML(
        self.host_name,
        version_snapshot,
        logging_level=self._logging_level,
        deriva_logging_level=self._deriva_logging_level,
    )

chaise_url

chaise_url(
    table: RID | Table | str,
) -> str

Generates Chaise web interface URL.

Chaise is Deriva's web interface for data exploration. This method creates a URL that directly links to the specified table or record.

Parameters:

Name Type Description Default
table RID | Table | str

Table to generate URL for (name, Table object, or RID).

required

Returns:

Name Type Description
str str

URL in format: https://{host}/chaise/recordset/#{catalog}/{schema}:{table}

Raises:

Type Description
DerivaMLException

If table or RID cannot be found.

Examples:

Using table name: >>> ml.chaise_url("experiment_table") 'https://deriva.org/chaise/recordset/#1/schema:experiment_table'

Using RID: >>> ml.chaise_url("1-abc123")

Source code in src/deriva_ml/core/base.py
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
def chaise_url(self, table: RID | Table | str) -> str:
    """Generates Chaise web interface URL.

    Chaise is Deriva's web interface for data exploration. This method creates a URL that directly links to
    the specified table or record.

    Args:
        table: Table to generate URL for (name, Table object, or RID).

    Returns:
        str: URL in format: https://{host}/chaise/recordset/#{catalog}/{schema}:{table}

    Raises:
        DerivaMLException: If table or RID cannot be found.

    Examples:
        Using table name:
            >>> ml.chaise_url("experiment_table")
            'https://deriva.org/chaise/recordset/#1/schema:experiment_table'

        Using RID:
            >>> ml.chaise_url("1-abc123")
    """
    # Get the table object and build base URI
    table_obj = self.model.name_to_table(table)
    try:
        uri = self.catalog.get_server_uri().replace("ermrest/catalog/", "chaise/recordset/#")
    except DerivaMLException:
        # Handle RID case
        uri = self.cite(cast(str, table))
    return f"{uri}/{urlquote(table_obj.schema.name)}:{urlquote(table_obj.name)}"

cite

cite(
    entity: Dict[str, Any] | str,
    current: bool = False,
) -> str

Generates citation URL for an entity.

Creates a URL that can be used to reference a specific entity in the catalog. By default, includes the catalog snapshot time to ensure version stability (permanent citation). With current=True, returns a URL to the current state.

Parameters:

Name Type Description Default
entity Dict[str, Any] | str

Either a RID string or a dictionary containing entity data with a 'RID' key.

required
current bool

If True, return URL to current catalog state (no snapshot). If False (default), return permanent citation URL with snapshot time.

False

Returns:

Name Type Description
str str

Citation URL. Format depends on current parameter: - current=False: https://{host}/id/{catalog}/{rid}@{snapshot_time} - current=True: https://{host}/id/{catalog}/{rid}

Raises:

Type Description
DerivaMLException

If an entity doesn't exist or lacks a RID.

Examples:

Permanent citation (default): >>> url = ml.cite("1-abc123") >>> print(url) 'https://deriva.org/id/1/1-abc123@2024-01-01T12:00:00'

Current catalog URL: >>> url = ml.cite("1-abc123", current=True) >>> print(url) 'https://deriva.org/id/1/1-abc123'

Using a dictionary: >>> url = ml.cite({"RID": "1-abc123"})

Source code in src/deriva_ml/core/base.py
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
def cite(self, entity: Dict[str, Any] | str, current: bool = False) -> str:
    """Generates citation URL for an entity.

    Creates a URL that can be used to reference a specific entity in the catalog.
    By default, includes the catalog snapshot time to ensure version stability
    (permanent citation). With current=True, returns a URL to the current state.

    Args:
        entity: Either a RID string or a dictionary containing entity data with a 'RID' key.
        current: If True, return URL to current catalog state (no snapshot).
                 If False (default), return permanent citation URL with snapshot time.

    Returns:
        str: Citation URL. Format depends on `current` parameter:
            - current=False: https://{host}/id/{catalog}/{rid}@{snapshot_time}
            - current=True: https://{host}/id/{catalog}/{rid}

    Raises:
        DerivaMLException: If an entity doesn't exist or lacks a RID.

    Examples:
        Permanent citation (default):
            >>> url = ml.cite("1-abc123")
            >>> print(url)
            'https://deriva.org/id/1/1-abc123@2024-01-01T12:00:00'

        Current catalog URL:
            >>> url = ml.cite("1-abc123", current=True)
            >>> print(url)
            'https://deriva.org/id/1/1-abc123'

        Using a dictionary:
            >>> url = ml.cite({"RID": "1-abc123"})
    """
    # Return if already a citation URL
    if isinstance(entity, str) and entity.startswith(f"https://{self.host_name}/id/{self.catalog_id}/"):
        return entity

    try:
        # Resolve RID and create citation URL
        self.resolve_rid(rid := entity if isinstance(entity, str) else entity["RID"])
        base_url = f"https://{self.host_name}/id/{self.catalog_id}/{rid}"
        if current:
            return base_url
        return f"{base_url}@{self.catalog.latest_snapshot().snaptime}"
    except KeyError as e:
        raise DerivaMLException(f"Entity {e} does not have RID column")
    except DerivaMLException as _e:
        raise DerivaMLException("Entity RID does not exist")

clean_execution_dirs

clean_execution_dirs(
    older_than_days: int | None = None,
    exclude_rids: list[str]
    | None = None,
) -> dict[str, int]

Clean up execution working directories.

Removes execution output directories from the local working directory. Use this to free up disk space from completed or orphaned executions.

Parameters:

Name Type Description Default
older_than_days int | None

If provided, only remove directories older than this many days. If None, removes all execution directories (except excluded).

None
exclude_rids list[str] | None

List of execution RIDs to preserve (never remove).

None

Returns:

Type Description
dict[str, int]

dict with keys: - 'dirs_removed': Number of directories removed - 'bytes_freed': Total bytes freed - 'errors': Number of removal errors

Example

ml = DerivaML('deriva.example.org', 'my_catalog')

Clean all execution dirs older than 30 days

result = ml.clean_execution_dirs(older_than_days=30) print(f"Freed {result['bytes_freed'] / 1e9:.2f} GB")

Clean all except specific executions

result = ml.clean_execution_dirs(exclude_rids=['1-ABC', '1-DEF'])

Source code in src/deriva_ml/core/base.py
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
def clean_execution_dirs(
    self,
    older_than_days: int | None = None,
    exclude_rids: list[str] | None = None,
) -> dict[str, int]:
    """Clean up execution working directories.

    Removes execution output directories from the local working directory.
    Use this to free up disk space from completed or orphaned executions.

    Args:
        older_than_days: If provided, only remove directories older than this
            many days. If None, removes all execution directories (except excluded).
        exclude_rids: List of execution RIDs to preserve (never remove).

    Returns:
        dict with keys:
            - 'dirs_removed': Number of directories removed
            - 'bytes_freed': Total bytes freed
            - 'errors': Number of removal errors

    Example:
        >>> ml = DerivaML('deriva.example.org', 'my_catalog')
        >>> # Clean all execution dirs older than 30 days
        >>> result = ml.clean_execution_dirs(older_than_days=30)
        >>> print(f"Freed {result['bytes_freed'] / 1e9:.2f} GB")
        >>>
        >>> # Clean all except specific executions
        >>> result = ml.clean_execution_dirs(exclude_rids=['1-ABC', '1-DEF'])
    """
    import shutil
    import time

    from deriva_ml.dataset.upload import upload_root

    stats = {'dirs_removed': 0, 'bytes_freed': 0, 'errors': 0}
    exclude_rids = set(exclude_rids or [])

    exec_root = upload_root(self.working_dir) / "execution"
    if not exec_root.exists():
        return stats

    cutoff_time = None
    if older_than_days is not None:
        cutoff_time = time.time() - (older_than_days * 24 * 60 * 60)

    for entry in exec_root.iterdir():
        if not entry.is_dir():
            continue

        # Skip excluded RIDs
        if entry.name in exclude_rids:
            continue

        try:
            # Check age if filtering
            if cutoff_time is not None:
                entry_mtime = entry.stat().st_mtime
                if entry_mtime > cutoff_time:
                    continue

            # Calculate size before removal
            entry_size = sum(f.stat().st_size for f in entry.rglob('*') if f.is_file())
            shutil.rmtree(entry)
            stats['dirs_removed'] += 1
            stats['bytes_freed'] += entry_size

        except (OSError, PermissionError) as e:
            self._logger.warning(f"Failed to remove execution dir {entry}: {e}")
            stats['errors'] += 1

    return stats

clear_cache

clear_cache(
    older_than_days: int | None = None,
) -> dict[str, int]

Clear the dataset cache directory.

Removes cached dataset bags from the cache directory. Can optionally filter by age to only remove old cache entries.

Parameters:

Name Type Description Default
older_than_days int | None

If provided, only remove cache entries older than this many days. If None, removes all cache entries.

None

Returns:

Type Description
dict[str, int]

dict with keys: - 'files_removed': Number of files removed - 'dirs_removed': Number of directories removed - 'bytes_freed': Total bytes freed - 'errors': Number of removal errors

Example

ml = DerivaML('deriva.example.org', 'my_catalog')

Clear all cache

result = ml.clear_cache() print(f"Freed {result['bytes_freed'] / 1e6:.1f} MB")

Clear cache older than 7 days

result = ml.clear_cache(older_than_days=7)

Source code in src/deriva_ml/core/base.py
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
def clear_cache(self, older_than_days: int | None = None) -> dict[str, int]:
    """Clear the dataset cache directory.

    Removes cached dataset bags from the cache directory. Can optionally filter
    by age to only remove old cache entries.

    Args:
        older_than_days: If provided, only remove cache entries older than this
            many days. If None, removes all cache entries.

    Returns:
        dict with keys:
            - 'files_removed': Number of files removed
            - 'dirs_removed': Number of directories removed
            - 'bytes_freed': Total bytes freed
            - 'errors': Number of removal errors

    Example:
        >>> ml = DerivaML('deriva.example.org', 'my_catalog')
        >>> # Clear all cache
        >>> result = ml.clear_cache()
        >>> print(f"Freed {result['bytes_freed'] / 1e6:.1f} MB")
        >>>
        >>> # Clear cache older than 7 days
        >>> result = ml.clear_cache(older_than_days=7)
    """
    import shutil
    import time

    stats = {'files_removed': 0, 'dirs_removed': 0, 'bytes_freed': 0, 'errors': 0}

    if not self.cache_dir.exists():
        return stats

    cutoff_time = None
    if older_than_days is not None:
        cutoff_time = time.time() - (older_than_days * 24 * 60 * 60)

    try:
        for entry in self.cache_dir.iterdir():
            try:
                # Check age if filtering
                if cutoff_time is not None:
                    entry_mtime = entry.stat().st_mtime
                    if entry_mtime > cutoff_time:
                        continue  # Skip recent entries

                # Calculate size before removal
                if entry.is_dir():
                    entry_size = sum(f.stat().st_size for f in entry.rglob('*') if f.is_file())
                    shutil.rmtree(entry)
                    stats['dirs_removed'] += 1
                else:
                    entry_size = entry.stat().st_size
                    entry.unlink()
                    stats['files_removed'] += 1

                stats['bytes_freed'] += entry_size
            except (OSError, PermissionError) as e:
                self._logger.warning(f"Failed to remove cache entry {entry}: {e}")
                stats['errors'] += 1

    except OSError as e:
        self._logger.error(f"Failed to iterate cache directory: {e}")
        stats['errors'] += 1

    return stats

clear_vocabulary_cache

clear_vocabulary_cache(
    table: str | Table | None = None,
) -> None

Clear the vocabulary term cache.

Parameters:

Name Type Description Default
table str | Table | None

If provided, only clear cache for this specific vocabulary table. If None, clear the entire cache.

None
Source code in src/deriva_ml/core/mixins/vocabulary.py
64
65
66
67
68
69
70
71
72
73
74
75
76
77
def clear_vocabulary_cache(self, table: str | Table | None = None) -> None:
    """Clear the vocabulary term cache.

    Args:
        table: If provided, only clear cache for this specific vocabulary table.
               If None, clear the entire cache.
    """
    cache = self._get_vocab_cache()
    if table is None:
        cache.clear()
    else:
        vocab_table = self.model.name_to_table(table)
        cache_key = (vocab_table.schema.name, vocab_table.name)
        cache.pop(cache_key, None)

create_asset

create_asset(
    asset_name: str,
    column_defs: Iterable[
        ColumnDefinition
    ]
    | None = None,
    fkey_defs: Iterable[
        ColumnDefinition
    ]
    | None = None,
    referenced_tables: Iterable[Table]
    | None = None,
    comment: str = "",
    schema: str | None = None,
    update_navbar: bool = True,
) -> Table

Creates an asset table.

Parameters:

Name Type Description Default
asset_name str

Name of the asset table.

required
column_defs Iterable[ColumnDefinition] | None

Iterable of ColumnDefinition objects to provide additional metadata for asset.

None
fkey_defs Iterable[ColumnDefinition] | None

Iterable of ForeignKeyDefinition objects to provide additional metadata for asset.

None
referenced_tables Iterable[Table] | None

Iterable of Table objects to which asset should provide foreign-key references to.

None
comment str

Description of the asset table. (Default value = '')

''
schema str | None

Schema in which to create the asset table. Defaults to domain_schema.

None
update_navbar bool

If True (default), automatically updates the navigation bar to include the new asset table. Set to False during batch asset creation to avoid redundant updates, then call apply_catalog_annotations() once at the end.

True

Returns:

Type Description
Table

Table object for the asset table.

Source code in src/deriva_ml/core/mixins/asset.py
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
def create_asset(
    self,
    asset_name: str,
    column_defs: Iterable[ColumnDefinition] | None = None,
    fkey_defs: Iterable[ColumnDefinition] | None = None,
    referenced_tables: Iterable[Table] | None = None,
    comment: str = "",
    schema: str | None = None,
    update_navbar: bool = True,
) -> Table:
    """Creates an asset table.

    Args:
        asset_name: Name of the asset table.
        column_defs: Iterable of ColumnDefinition objects to provide additional metadata for asset.
        fkey_defs: Iterable of ForeignKeyDefinition objects to provide additional metadata for asset.
        referenced_tables: Iterable of Table objects to which asset should provide foreign-key references to.
        comment: Description of the asset table. (Default value = '')
        schema: Schema in which to create the asset table.  Defaults to domain_schema.
        update_navbar: If True (default), automatically updates the navigation bar to include
            the new asset table. Set to False during batch asset creation to avoid redundant
            updates, then call apply_catalog_annotations() once at the end.

    Returns:
        Table object for the asset table.
    """
    # Initialize empty collections if None provided
    column_defs = column_defs or []
    fkey_defs = fkey_defs or []
    referenced_tables = referenced_tables or []
    schema = schema or self.model._require_default_schema()

    # Add an asset type to vocabulary
    self.add_term(MLVocab.asset_type, asset_name, description=f"A {asset_name} asset")

    # Create the main asset table
    # Note: column_defs and fkey_defs should be ColumnDef/ForeignKeyDef objects
    asset_table = self.model.schemas[schema].create_table(
        AssetTableDef(
            schema_name=schema,
            name=asset_name,
            columns=list(column_defs),
            foreign_keys=list(fkey_defs),
            comment=comment,
        )
    )

    # Create an association table between asset and asset type
    self.model.create_table(
        self.model._define_association(
            associates=[
                (asset_table.name, asset_table),
                ("Asset_Type", self.model.name_to_table("Asset_Type")),
            ],
        ),
        schema=schema,
    )

    # Create references to other tables if specified
    for t in referenced_tables:
        asset_table.create_reference(self.model.name_to_table(t))

    # Create an association table for tracking execution
    atable = self.model.create_table(
        self.model._define_association(
            associates=[
                (asset_name, asset_table),
                (
                    "Execution",
                    self.model.schemas[self.ml_schema].tables["Execution"],
                ),
            ],
        ),
        schema=schema,
    )
    atable.create_reference(self.model.name_to_table("Asset_Role"))

    # Add asset annotations
    asset_annotation(asset_table)

    # Update navbar to include the new asset table
    if update_navbar:
        self.apply_catalog_annotations()

    return asset_table

create_execution

create_execution(
    configuration: ExecutionConfiguration,
    workflow: "Workflow | RID | None" = None,
    dry_run: bool = False,
) -> "Execution"

Create an execution environment.

Initializes a local compute environment for executing an ML or analytic routine. This has several side effects:

  1. Downloads datasets specified in the configuration to the cache directory. If no version is specified, creates a new minor version for the dataset.
  2. Downloads any execution assets to the working directory.
  3. Creates an execution record in the catalog (unless dry_run=True).

Parameters:

Name Type Description Default
configuration ExecutionConfiguration

ExecutionConfiguration specifying execution parameters.

required
workflow 'Workflow | RID | None'

Optional Workflow object or RID if not present in configuration.

None
dry_run bool

If True, skip creating catalog records and uploading results.

False

Returns:

Name Type Description
Execution 'Execution'

An execution object for managing the execution lifecycle.

Example

config = ExecutionConfiguration( ... workflow=workflow, ... description="Process samples", ... datasets=[DatasetSpec(rid="4HM")], ... ) with ml.create_execution(config) as execution: ... # Run analysis ... pass execution.upload_execution_outputs()

Source code in src/deriva_ml/core/mixins/execution.py
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
def create_execution(
    self, configuration: ExecutionConfiguration, workflow: "Workflow | RID | None" = None, dry_run: bool = False
) -> "Execution":
    """Create an execution environment.

    Initializes a local compute environment for executing an ML or analytic routine.
    This has several side effects:

    1. Downloads datasets specified in the configuration to the cache directory.
       If no version is specified, creates a new minor version for the dataset.
    2. Downloads any execution assets to the working directory.
    3. Creates an execution record in the catalog (unless dry_run=True).

    Args:
        configuration: ExecutionConfiguration specifying execution parameters.
        workflow: Optional Workflow object or RID if not present in configuration.
        dry_run: If True, skip creating catalog records and uploading results.

    Returns:
        Execution: An execution object for managing the execution lifecycle.

    Example:
        >>> config = ExecutionConfiguration(
        ...     workflow=workflow,
        ...     description="Process samples",
        ...     datasets=[DatasetSpec(rid="4HM")],
        ... )
        >>> with ml.create_execution(config) as execution:
        ...     # Run analysis
        ...     pass
        >>> execution.upload_execution_outputs()
    """
    # Import here to avoid circular dependency
    from deriva_ml.execution.execution import Execution

    # Create and store an execution instance
    self._execution = Execution(configuration, self, workflow=workflow, dry_run=dry_run)  # type: ignore[arg-type]
    return self._execution

create_feature

create_feature(
    target_table: Table | str,
    feature_name: str,
    terms: list[Table | str]
    | None = None,
    assets: list[Table | str]
    | None = None,
    metadata: list[
        ColumnDefinition
        | Table
        | Key
        | str
    ]
    | None = None,
    optional: list[str] | None = None,
    comment: str = "",
    update_navbar: bool = True,
) -> type[FeatureRecord]

Creates a new feature definition.

A feature represents a measurable property or characteristic that can be associated with records in the target table. Features can include vocabulary terms, asset references, and additional metadata.

Side Effects: This method dynamically creates: 1. A new association table in the domain schema to store feature values 2. A Pydantic model class (subclass of FeatureRecord) for creating validated feature instances

The returned Pydantic model class provides type-safe construction of feature records with automatic validation of values against the feature's definition (vocabulary terms, asset references, etc.). Use this class to create feature instances that can be inserted into the catalog.

Parameters:

Name Type Description Default
target_table Table | str

Table to associate the feature with (name or Table object).

required
feature_name str

Unique name for the feature within the target table.

required
terms list[Table | str] | None

Optional vocabulary tables/names whose terms can be used as feature values.

None
assets list[Table | str] | None

Optional asset tables/names that can be referenced by this feature.

None
metadata list[ColumnDefinition | Table | Key | str] | None

Optional columns, tables, or keys to include in a feature definition.

None
optional list[str] | None

Column names that are not required when creating feature instances.

None
comment str

Description of the feature's purpose and usage.

''
update_navbar bool

If True (default), automatically updates the navigation bar to include the new feature table. Set to False during batch feature creation to avoid redundant updates, then call apply_catalog_annotations() once at the end.

True

Returns:

Type Description
type[FeatureRecord]

type[FeatureRecord]: A dynamically generated Pydantic model class for creating validated feature instances. The class has fields corresponding to the feature's terms, assets, and metadata columns.

Raises:

Type Description
DerivaMLException

If a feature definition is invalid or conflicts with existing features.

Examples:

Create a feature with confidence score: >>> DiagnosisFeature = ml.create_feature( ... target_table="Image", ... feature_name="Diagnosis", ... terms=["Diagnosis_Type"], ... metadata=[ColumnDefinition(name="confidence", type=BuiltinTypes.float4)], ... comment="Clinical diagnosis label" ... ) >>> # Use the returned class to create validated feature instances >>> record = DiagnosisFeature( ... Image="1-ABC", # Target record RID ... Diagnosis_Type="Normal", # Vocabulary term ... confidence=0.95, ... Execution="2-XYZ" # Execution that produced this value ... )

Source code in src/deriva_ml/core/mixins/feature.py
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
def create_feature(
    self,
    target_table: Table | str,
    feature_name: str,
    terms: list[Table | str] | None = None,
    assets: list[Table | str] | None = None,
    metadata: list[ColumnDefinition | Table | Key | str] | None = None,
    optional: list[str] | None = None,
    comment: str = "",
    update_navbar: bool = True,
) -> type[FeatureRecord]:
    """Creates a new feature definition.

    A feature represents a measurable property or characteristic that can be associated with records in the target
    table. Features can include vocabulary terms, asset references, and additional metadata.

    **Side Effects**:
    This method dynamically creates:
    1. A new association table in the domain schema to store feature values
    2. A Pydantic model class (subclass of FeatureRecord) for creating validated feature instances

    The returned Pydantic model class provides type-safe construction of feature records with
    automatic validation of values against the feature's definition (vocabulary terms, asset
    references, etc.). Use this class to create feature instances that can be inserted into
    the catalog.

    Args:
        target_table: Table to associate the feature with (name or Table object).
        feature_name: Unique name for the feature within the target table.
        terms: Optional vocabulary tables/names whose terms can be used as feature values.
        assets: Optional asset tables/names that can be referenced by this feature.
        metadata: Optional columns, tables, or keys to include in a feature definition.
        optional: Column names that are not required when creating feature instances.
        comment: Description of the feature's purpose and usage.
        update_navbar: If True (default), automatically updates the navigation bar to include
            the new feature table. Set to False during batch feature creation to avoid
            redundant updates, then call apply_catalog_annotations() once at the end.

    Returns:
        type[FeatureRecord]: A dynamically generated Pydantic model class for creating
            validated feature instances. The class has fields corresponding to the feature's
            terms, assets, and metadata columns.

    Raises:
        DerivaMLException: If a feature definition is invalid or conflicts with existing features.

    Examples:
        Create a feature with confidence score:
            >>> DiagnosisFeature = ml.create_feature(
            ...     target_table="Image",
            ...     feature_name="Diagnosis",
            ...     terms=["Diagnosis_Type"],
            ...     metadata=[ColumnDefinition(name="confidence", type=BuiltinTypes.float4)],
            ...     comment="Clinical diagnosis label"
            ... )
            >>> # Use the returned class to create validated feature instances
            >>> record = DiagnosisFeature(
            ...     Image="1-ABC",  # Target record RID
            ...     Diagnosis_Type="Normal",  # Vocabulary term
            ...     confidence=0.95,
            ...     Execution="2-XYZ"  # Execution that produced this value
            ... )
    """
    # Initialize empty collections if None provided
    terms = terms or []
    assets = assets or []
    metadata = metadata or []
    optional = optional or []

    def normalize_metadata(m: Key | Table | ColumnDefinition | str | dict) -> Key | Table | dict:
        """Helper function to normalize metadata references.

        Handles:
        - str: Table name, converted to Table object
        - ColumnDefinition: Dataclass with to_dict() method
        - dict: Already in dict format (from Column.define())
        - Key/Table: Passed through unchanged
        """
        if isinstance(m, str):
            return self.model.name_to_table(m)
        elif isinstance(m, dict):
            # Already a dict (e.g., from Column.define())
            return m
        elif hasattr(m, 'to_dict'):
            # ColumnDefinition or similar dataclass
            return m.to_dict()
        else:
            return m

    # Validate asset and term tables
    if not all(map(self.model.is_asset, assets)):
        raise DerivaMLException("Invalid create_feature asset table.")
    if not all(map(self.model.is_vocabulary, terms)):
        raise DerivaMLException("Invalid create_feature asset table.")

    # Get references to required tables
    target_table = self.model.name_to_table(target_table)
    execution = self.model.schemas[self.ml_schema].tables["Execution"]
    feature_name_table = self.model.schemas[self.ml_schema].tables["Feature_Name"]

    # Add feature name to vocabulary
    feature_name_term = self.add_term("Feature_Name", feature_name, description=comment)
    atable_name = f"Execution_{target_table.name}_{feature_name_term.name}"
    # Create an association table implementing the feature
    atable = self.model.create_table(
        self.model._define_association(
            table_name=atable_name,
            associates=[execution, target_table, feature_name_table],
            metadata=[normalize_metadata(m) for m in chain(assets, terms, metadata)],
            comment=comment,
        )
    )
    # Configure optional columns and default feature name
    for c in optional:
        atable.columns[c].alter(nullok=True)
    atable.columns["Feature_Name"].alter(default=feature_name_term.name)

    # Update navbar to include the new feature table
    if update_navbar:
        self.apply_catalog_annotations()

    # Return feature record class for creating instances
    return self.feature_record_class(target_table, feature_name)

create_table

create_table(
    table: TableDefinition,
    schema: str | None = None,
    update_navbar: bool = True,
) -> Table

Creates a new table in the domain schema.

Creates a table using the provided TableDefinition object, which specifies the table structure including columns, keys, and foreign key relationships. The table is created in the domain schema associated with this DerivaML instance.

Required Classes: Import the following classes from deriva_ml to define tables:

  • TableDefinition: Defines the complete table structure
  • ColumnDefinition: Defines individual columns with types and constraints
  • KeyDefinition: Defines unique key constraints (optional)
  • ForeignKeyDefinition: Defines foreign key relationships to other tables (optional)
  • BuiltinTypes: Enum of available column data types

Available Column Types (BuiltinTypes enum): text, int2, int4, int8, float4, float8, boolean, date, timestamp, timestamptz, json, jsonb, markdown, ermrest_uri, ermrest_rid, ermrest_rcb, ermrest_rmb, ermrest_rct, ermrest_rmt

Parameters:

Name Type Description Default
table TableDefinition

A TableDefinition object containing the complete specification of the table to create.

required
update_navbar bool

If True (default), automatically updates the navigation bar to include the new table. Set to False during batch table creation to avoid redundant updates, then call apply_catalog_annotations() once at the end.

True

Returns:

Name Type Description
Table Table

The newly created ERMRest table object.

Raises:

Type Description
DerivaMLException

If table creation fails or the definition is invalid.

Examples:

Simple table with basic columns:

>>> from deriva_ml import TableDefinition, ColumnDefinition, BuiltinTypes
>>>
>>> table_def = TableDefinition(
...     name="Experiment",
...     column_defs=[
...         ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
...         ColumnDefinition(name="Date", type=BuiltinTypes.date),
...         ColumnDefinition(name="Description", type=BuiltinTypes.markdown),
...         ColumnDefinition(name="Score", type=BuiltinTypes.float4),
...     ],
...     comment="Records of experimental runs"
... )
>>> experiment_table = ml.create_table(table_def)

Table with foreign key to another table:

>>> from deriva_ml import (
...     TableDefinition, ColumnDefinition, ForeignKeyDefinition, BuiltinTypes
... )
>>>
>>> # Create a Sample table that references Subject
>>> sample_def = TableDefinition(
...     name="Sample",
...     column_defs=[
...         ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
...         ColumnDefinition(name="Subject", type=BuiltinTypes.text, nullok=False),
...         ColumnDefinition(name="Collection_Date", type=BuiltinTypes.date),
...     ],
...     fkey_defs=[
...         ForeignKeyDefinition(
...             colnames=["Subject"],
...             pk_sname=ml.default_schema,  # Schema of referenced table
...             pk_tname="Subject",          # Name of referenced table
...             pk_colnames=["RID"],         # Column(s) in referenced table
...             on_delete="CASCADE",         # Delete samples when subject deleted
...         )
...     ],
...     comment="Biological samples collected from subjects"
... )
>>> sample_table = ml.create_table(sample_def)

Table with unique key constraint:

>>> from deriva_ml import (
...     TableDefinition, ColumnDefinition, KeyDefinition, BuiltinTypes
... )
>>>
>>> protocol_def = TableDefinition(
...     name="Protocol",
...     column_defs=[
...         ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
...         ColumnDefinition(name="Version", type=BuiltinTypes.text, nullok=False),
...         ColumnDefinition(name="Description", type=BuiltinTypes.markdown),
...     ],
...     key_defs=[
...         KeyDefinition(
...             colnames=["Name", "Version"],
...             constraint_names=[["myschema", "Protocol_Name_Version_key"]],
...             comment="Each protocol name+version must be unique"
...         )
...     ],
...     comment="Experimental protocols with versioning"
... )
>>> protocol_table = ml.create_table(protocol_def)

Batch creation without navbar updates:

>>> ml.create_table(table1_def, update_navbar=False)
>>> ml.create_table(table2_def, update_navbar=False)
>>> ml.create_table(table3_def, update_navbar=False)
>>> ml.apply_catalog_annotations()  # Update navbar once at the end
Source code in src/deriva_ml/core/base.py
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
def create_table(self, table: TableDefinition, schema: str | None = None, update_navbar: bool = True) -> Table:
    """Creates a new table in the domain schema.

    Creates a table using the provided TableDefinition object, which specifies the table structure
    including columns, keys, and foreign key relationships. The table is created in the domain
    schema associated with this DerivaML instance.

    **Required Classes**:
    Import the following classes from deriva_ml to define tables:

    - ``TableDefinition``: Defines the complete table structure
    - ``ColumnDefinition``: Defines individual columns with types and constraints
    - ``KeyDefinition``: Defines unique key constraints (optional)
    - ``ForeignKeyDefinition``: Defines foreign key relationships to other tables (optional)
    - ``BuiltinTypes``: Enum of available column data types

    **Available Column Types** (BuiltinTypes enum):
    ``text``, ``int2``, ``int4``, ``int8``, ``float4``, ``float8``, ``boolean``,
    ``date``, ``timestamp``, ``timestamptz``, ``json``, ``jsonb``, ``markdown``,
    ``ermrest_uri``, ``ermrest_rid``, ``ermrest_rcb``, ``ermrest_rmb``,
    ``ermrest_rct``, ``ermrest_rmt``

    Args:
        table: A TableDefinition object containing the complete specification of the table to create.
        update_navbar: If True (default), automatically updates the navigation bar to include
            the new table. Set to False during batch table creation to avoid redundant updates,
            then call apply_catalog_annotations() once at the end.

    Returns:
        Table: The newly created ERMRest table object.

    Raises:
        DerivaMLException: If table creation fails or the definition is invalid.

    Examples:
        **Simple table with basic columns**:

            >>> from deriva_ml import TableDefinition, ColumnDefinition, BuiltinTypes
            >>>
            >>> table_def = TableDefinition(
            ...     name="Experiment",
            ...     column_defs=[
            ...         ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
            ...         ColumnDefinition(name="Date", type=BuiltinTypes.date),
            ...         ColumnDefinition(name="Description", type=BuiltinTypes.markdown),
            ...         ColumnDefinition(name="Score", type=BuiltinTypes.float4),
            ...     ],
            ...     comment="Records of experimental runs"
            ... )
            >>> experiment_table = ml.create_table(table_def)

        **Table with foreign key to another table**:

            >>> from deriva_ml import (
            ...     TableDefinition, ColumnDefinition, ForeignKeyDefinition, BuiltinTypes
            ... )
            >>>
            >>> # Create a Sample table that references Subject
            >>> sample_def = TableDefinition(
            ...     name="Sample",
            ...     column_defs=[
            ...         ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
            ...         ColumnDefinition(name="Subject", type=BuiltinTypes.text, nullok=False),
            ...         ColumnDefinition(name="Collection_Date", type=BuiltinTypes.date),
            ...     ],
            ...     fkey_defs=[
            ...         ForeignKeyDefinition(
            ...             colnames=["Subject"],
            ...             pk_sname=ml.default_schema,  # Schema of referenced table
            ...             pk_tname="Subject",          # Name of referenced table
            ...             pk_colnames=["RID"],         # Column(s) in referenced table
            ...             on_delete="CASCADE",         # Delete samples when subject deleted
            ...         )
            ...     ],
            ...     comment="Biological samples collected from subjects"
            ... )
            >>> sample_table = ml.create_table(sample_def)

        **Table with unique key constraint**:

            >>> from deriva_ml import (
            ...     TableDefinition, ColumnDefinition, KeyDefinition, BuiltinTypes
            ... )
            >>>
            >>> protocol_def = TableDefinition(
            ...     name="Protocol",
            ...     column_defs=[
            ...         ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
            ...         ColumnDefinition(name="Version", type=BuiltinTypes.text, nullok=False),
            ...         ColumnDefinition(name="Description", type=BuiltinTypes.markdown),
            ...     ],
            ...     key_defs=[
            ...         KeyDefinition(
            ...             colnames=["Name", "Version"],
            ...             constraint_names=[["myschema", "Protocol_Name_Version_key"]],
            ...             comment="Each protocol name+version must be unique"
            ...         )
            ...     ],
            ...     comment="Experimental protocols with versioning"
            ... )
            >>> protocol_table = ml.create_table(protocol_def)

        **Batch creation without navbar updates**:

            >>> ml.create_table(table1_def, update_navbar=False)
            >>> ml.create_table(table2_def, update_navbar=False)
            >>> ml.create_table(table3_def, update_navbar=False)
            >>> ml.apply_catalog_annotations()  # Update navbar once at the end
    """
    # Use default schema if none specified
    schema = schema or self.model._require_default_schema()

    # Create table in domain schema using provided definition
    # Handle both TableDefinition (dataclass with to_dict) and plain dicts
    table_dict = table.to_dict() if hasattr(table, 'to_dict') else table
    new_table = self.model.schemas[schema].create_table(table_dict)

    # Update navbar to include the new table
    if update_navbar:
        self.apply_catalog_annotations()

    return new_table

create_vocabulary

create_vocabulary(
    vocab_name: str,
    comment: str = "",
    schema: str | None = None,
    update_navbar: bool = True,
) -> Table

Creates a controlled vocabulary table.

A controlled vocabulary table maintains a list of standardized terms and their definitions. Each term can have synonyms and descriptions to ensure consistent terminology usage across the dataset.

Parameters:

Name Type Description Default
vocab_name str

Name for the new vocabulary table. Must be a valid SQL identifier.

required
comment str

Description of the vocabulary's purpose and usage. Defaults to empty string.

''
schema str | None

Schema name to create the table in. If None, uses domain_schema.

None
update_navbar bool

If True (default), automatically updates the navigation bar to include the new vocabulary table. Set to False during batch table creation to avoid redundant updates, then call apply_catalog_annotations() once at the end.

True

Returns:

Name Type Description
Table Table

ERMRest table object representing the newly created vocabulary table.

Raises:

Type Description
DerivaMLException

If vocab_name is invalid or already exists.

Examples:

Create a vocabulary for tissue types:

>>> table = ml.create_vocabulary(
...     vocab_name="tissue_types",
...     comment="Standard tissue classifications",
...     schema="bio_schema"
... )

Create multiple vocabularies without updating navbar until the end:

>>> ml.create_vocabulary("Species", update_navbar=False)
>>> ml.create_vocabulary("Tissue_Type", update_navbar=False)
>>> ml.apply_catalog_annotations()  # Update navbar once
Source code in src/deriva_ml/core/base.py
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
def create_vocabulary(
    self, vocab_name: str, comment: str = "", schema: str | None = None, update_navbar: bool = True
) -> Table:
    """Creates a controlled vocabulary table.

    A controlled vocabulary table maintains a list of standardized terms and their definitions. Each term can have
    synonyms and descriptions to ensure consistent terminology usage across the dataset.

    Args:
        vocab_name: Name for the new vocabulary table. Must be a valid SQL identifier.
        comment: Description of the vocabulary's purpose and usage. Defaults to empty string.
        schema: Schema name to create the table in. If None, uses domain_schema.
        update_navbar: If True (default), automatically updates the navigation bar to include
            the new vocabulary table. Set to False during batch table creation to avoid
            redundant updates, then call apply_catalog_annotations() once at the end.

    Returns:
        Table: ERMRest table object representing the newly created vocabulary table.

    Raises:
        DerivaMLException: If vocab_name is invalid or already exists.

    Examples:
        Create a vocabulary for tissue types:

            >>> table = ml.create_vocabulary(
            ...     vocab_name="tissue_types",
            ...     comment="Standard tissue classifications",
            ...     schema="bio_schema"
            ... )

        Create multiple vocabularies without updating navbar until the end:

            >>> ml.create_vocabulary("Species", update_navbar=False)
            >>> ml.create_vocabulary("Tissue_Type", update_navbar=False)
            >>> ml.apply_catalog_annotations()  # Update navbar once
    """
    # Use default schema if none specified
    schema = schema or self.model._require_default_schema()

    # Create and return vocabulary table with RID-based URI pattern
    try:
        vocab_table = self.model.schemas[schema].create_table(
            VocabularyTableDef(
                name=vocab_name,
                curie_template=f"{self.project_name}:{{RID}}",
                comment=comment,
            )
        )
    except ValueError:
        raise DerivaMLException(f"Table {vocab_name} already exist")

    # Update navbar to include the new vocabulary table
    if update_navbar:
        self.apply_catalog_annotations()

    return vocab_table

create_workflow

create_workflow(
    name: str,
    workflow_type: str | list[str],
    description: str = "",
) -> Workflow

Creates a new workflow definition.

Creates a Workflow object that represents a computational process or analysis pipeline. The workflow type(s) must be terms from the controlled vocabulary. This method is typically used to define new analysis workflows before execution.

Parameters:

Name Type Description Default
name str

Name of the workflow.

required
workflow_type str | list[str]

Type(s) of workflow (must exist in workflow_type vocabulary). Can be a single string or a list of strings.

required
description str

Description of what the workflow does.

''

Returns:

Name Type Description
Workflow Workflow

New workflow object ready for registration.

Raises:

Type Description
DerivaMLException

If any workflow_type is not in the vocabulary.

Examples:

>>> workflow = ml.create_workflow(
...     name="RNA Analysis",
...     workflow_type="python_notebook",
...     description="RNA sequence analysis pipeline"
... )
>>> rid = ml.add_workflow(workflow)

Multiple types::

>>> workflow = ml.create_workflow(
...     name="Training Pipeline",
...     workflow_type=["Training", "Embedding"],
...     description="Combined training and embedding pipeline"
... )
Source code in src/deriva_ml/core/mixins/workflow.py
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
def create_workflow(self, name: str, workflow_type: str | list[str], description: str = "") -> Workflow:
    """Creates a new workflow definition.

    Creates a Workflow object that represents a computational process or analysis pipeline. The workflow type(s)
    must be terms from the controlled vocabulary. This method is typically used to define new analysis
    workflows before execution.

    Args:
        name: Name of the workflow.
        workflow_type: Type(s) of workflow (must exist in workflow_type vocabulary).
            Can be a single string or a list of strings.
        description: Description of what the workflow does.

    Returns:
        Workflow: New workflow object ready for registration.

    Raises:
        DerivaMLException: If any workflow_type is not in the vocabulary.

    Examples:
        >>> workflow = ml.create_workflow(
        ...     name="RNA Analysis",
        ...     workflow_type="python_notebook",
        ...     description="RNA sequence analysis pipeline"
        ... )
        >>> rid = ml.add_workflow(workflow)

        Multiple types::

            >>> workflow = ml.create_workflow(
            ...     name="Training Pipeline",
            ...     workflow_type=["Training", "Embedding"],
            ...     description="Combined training and embedding pipeline"
            ... )
    """
    # Normalize to list and validate each type exists in vocabulary
    types = [workflow_type] if isinstance(workflow_type, str) else workflow_type
    for wt in types:
        self.lookup_term(MLVocab.workflow_type, wt)

    # Create and return a new workflow object
    return Workflow(name=name, workflow_type=workflow_type, description=description)

define_association

define_association(
    associates: list,
    metadata: list | None = None,
    table_name: str | None = None,
    comment: str | None = None,
    **kwargs,
) -> dict

Build an association table definition with vocab-aware key selection.

Creates a table definition that links two or more tables via an association (many-to-many) table. Non-vocabulary tables automatically use RID as the foreign key target, while vocabulary tables use their Name key.

Use with create_table() to create the association table in the catalog.

Parameters:

Name Type Description Default
associates list

Tables to associate. Each item can be: - A Table object - A (name, Table) tuple to customize the column name - A (name, nullok, Table) tuple for nullable references - A Key object for explicit key selection

required
metadata list | None

Additional metadata columns or reference targets.

None
table_name str | None

Name for the association table. Auto-generated if omitted.

None
comment str | None

Comment for the association table.

None
**kwargs

Additional arguments passed to Table.define_association.

{}

Returns:

Type Description
dict

Table definition dict suitable for create_table().

Example::

# Associate Image with Subject (many-to-many)
image_table = ml.model.name_to_table("Image")
subject_table = ml.model.name_to_table("Subject")
assoc_def = ml.define_association(
    associates=[image_table, subject_table],
    comment="Links images to subjects",
)
ml.create_table(assoc_def)
Source code in src/deriva_ml/core/base.py
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
def define_association(
    self,
    associates: list,
    metadata: list | None = None,
    table_name: str | None = None,
    comment: str | None = None,
    **kwargs,
) -> dict:
    """Build an association table definition with vocab-aware key selection.

    Creates a table definition that links two or more tables via an association
    (many-to-many) table. Non-vocabulary tables automatically use RID as the
    foreign key target, while vocabulary tables use their Name key.

    Use with ``create_table()`` to create the association table in the catalog.

    Args:
        associates: Tables to associate. Each item can be:
            - A Table object
            - A (name, Table) tuple to customize the column name
            - A (name, nullok, Table) tuple for nullable references
            - A Key object for explicit key selection
        metadata: Additional metadata columns or reference targets.
        table_name: Name for the association table. Auto-generated if omitted.
        comment: Comment for the association table.
        **kwargs: Additional arguments passed to Table.define_association.

    Returns:
        Table definition dict suitable for ``create_table()``.

    Example::

        # Associate Image with Subject (many-to-many)
        image_table = ml.model.name_to_table("Image")
        subject_table = ml.model.name_to_table("Subject")
        assoc_def = ml.define_association(
            associates=[image_table, subject_table],
            comment="Links images to subjects",
        )
        ml.create_table(assoc_def)
    """
    return self.model._define_association(
        associates=associates,
        metadata=metadata,
        table_name=table_name,
        comment=comment,
        **kwargs,
    )

delete_dataset

delete_dataset(
    dataset: "Dataset",
    recurse: bool = False,
) -> None

Soft-delete a dataset by marking it as deleted in the catalog.

Sets the Deleted flag on the dataset record. The dataset's data is preserved but it will no longer appear in normal queries (e.g., find_datasets()). The dataset cannot be deleted if it is currently nested inside a parent dataset.

Parameters:

Name Type Description Default
dataset Dataset

The dataset to delete.

required
recurse bool

If True, also soft-delete all nested child datasets. If False (default), only this dataset is marked as deleted.

False

Raises:

Type Description
DerivaMLException

If the dataset RID is not a valid dataset, or if the dataset is nested inside a parent dataset.

Source code in src/deriva_ml/core/mixins/dataset.py
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
def delete_dataset(self, dataset: "Dataset", recurse: bool = False) -> None:
    """Soft-delete a dataset by marking it as deleted in the catalog.

    Sets the ``Deleted`` flag on the dataset record. The dataset's data is
    preserved but it will no longer appear in normal queries (e.g.,
    ``find_datasets()``). The dataset cannot be deleted if it is currently
    nested inside a parent dataset.

    Args:
        dataset (Dataset): The dataset to delete.
        recurse (bool): If True, also soft-delete all nested child datasets.
            If False (default), only this dataset is marked as deleted.

    Raises:
        DerivaMLException: If the dataset RID is not a valid dataset, or if the
            dataset is nested inside a parent dataset.
    """
    # Get association table entries for this dataset_table
    # Delete association table entries
    dataset_rid = dataset.dataset_rid
    if not self.model.is_dataset_rid(dataset.dataset_rid):
        raise DerivaMLException("Dataset_rid is not a dataset.")

    if parents := dataset.list_dataset_parents():
        raise DerivaMLException(f'Dataset "{dataset}" is in a nested dataset: {parents}.')

    pb = self.pathBuilder()
    dataset_path = pb.schemas[self._dataset_table.schema.name].tables[self._dataset_table.name]

    # list_dataset_children returns Dataset objects, so extract their RIDs
    child_rids = [ds.dataset_rid for ds in dataset.list_dataset_children()] if recurse else []
    rid_list = [dataset_rid] + child_rids
    dataset_path.update([{"RID": r, "Deleted": True} for r in rid_list])

delete_feature

delete_feature(
    table: Table | str,
    feature_name: str,
) -> bool

Removes a feature definition and its data.

Deletes the feature and its implementation table from the catalog. This operation cannot be undone and will remove all feature values associated with this feature.

Parameters:

Name Type Description Default
table Table | str

The table containing the feature, either as name or Table object.

required
feature_name str

Name of the feature to delete.

required

Returns:

Name Type Description
bool bool

True if the feature was successfully deleted, False if it didn't exist.

Raises:

Type Description
DerivaMLException

If deletion fails due to constraints or permissions.

Example

success = ml.delete_feature("samples", "obsolete_feature") print("Deleted" if success else "Not found")

Source code in src/deriva_ml/core/mixins/feature.py
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
def delete_feature(self, table: Table | str, feature_name: str) -> bool:
    """Removes a feature definition and its data.

    Deletes the feature and its implementation table from the catalog. This operation cannot be undone and
    will remove all feature values associated with this feature.

    Args:
        table: The table containing the feature, either as name or Table object.
        feature_name: Name of the feature to delete.

    Returns:
        bool: True if the feature was successfully deleted, False if it didn't exist.

    Raises:
        DerivaMLException: If deletion fails due to constraints or permissions.

    Example:
        >>> success = ml.delete_feature("samples", "obsolete_feature")
        >>> print("Deleted" if success else "Not found")
    """
    # Get table reference and find feature
    table = self.model.name_to_table(table)
    try:
        # Find and delete the feature's implementation table
        feature = next(f for f in self.model.find_features(table) if f.feature_name == feature_name)
        feature.feature_table.drop()
        return True
    except StopIteration:
        return False

delete_term

delete_term(
    table: str | Table, term_name: str
) -> None

Delete a term from a vocabulary table.

Removes a term from the vocabulary. The term must not be in use by any records in the catalog (e.g., no datasets using this dataset type, no assets using this asset type).

Parameters:

Name Type Description Default
table str | Table

Vocabulary table containing the term (name or Table object).

required
term_name str

Primary name of the term to delete.

required

Raises:

Type Description
DerivaMLInvalidTerm

If the term doesn't exist in the vocabulary.

DerivaMLException

If the term is currently in use by other records.

Example

ml.delete_term("Dataset_Type", "Obsolete_Type")

Source code in src/deriva_ml/core/mixins/vocabulary.py
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def delete_term(self, table: str | Table, term_name: str) -> None:
    """Delete a term from a vocabulary table.

    Removes a term from the vocabulary. The term must not be in use by any
    records in the catalog (e.g., no datasets using this dataset type, no
    assets using this asset type).

    Args:
        table: Vocabulary table containing the term (name or Table object).
        term_name: Primary name of the term to delete.

    Raises:
        DerivaMLInvalidTerm: If the term doesn't exist in the vocabulary.
        DerivaMLException: If the term is currently in use by other records.

    Example:
        >>> ml.delete_term("Dataset_Type", "Obsolete_Type")
    """
    # Look up the term (validates table and term existence)
    term = self.lookup_term(table, term_name)
    vocab_table = self.model.name_to_table(table)

    # Check if the term is in use by examining association tables
    associations = list(vocab_table.find_associations())
    pb = self.pathBuilder()

    for assoc in associations:
        assoc_path = pb.schemas[assoc.schema.name].tables[assoc.name]
        # Check if any rows reference this term
        count = len(list(assoc_path.filter(getattr(assoc_path, vocab_table.name) == term.name).entities().fetch()))
        if count > 0:
            raise DerivaMLException(
                f"Cannot delete term '{term_name}' from {vocab_table.name}: "
                f"it is referenced by {count} record(s) in {assoc.name}"
            )

    # No references found - safe to delete
    table_path = pb.schemas[vocab_table.schema.name].tables[vocab_table.name]
    table_path.filter(table_path.RID == term.rid).delete()

    # Invalidate cache
    self.clear_vocabulary_cache(table)

domain_path

domain_path(
    schema: str | None = None,
) -> datapath.DataPath

Returns path builder for a domain schema.

Provides a convenient way to access tables and construct queries within a domain-specific schema.

Parameters:

Name Type Description Default
schema str | None

Schema name to get path builder for. If None, uses default_schema.

None

Returns:

Type Description
DataPath

datapath._CatalogWrapper: Path builder object scoped to the specified domain schema.

Raises:

Type Description
DerivaMLException

If no schema specified and default_schema is not set.

Example

domain = ml.domain_path() # Uses default schema results = domain.my_table.entities().fetch()

Or with explicit schema:

domain = ml.domain_path("my_schema")

Source code in src/deriva_ml/core/mixins/path_builder.py
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
def domain_path(self, schema: str | None = None) -> datapath.DataPath:
    """Returns path builder for a domain schema.

    Provides a convenient way to access tables and construct queries within a domain-specific schema.

    Args:
        schema: Schema name to get path builder for. If None, uses default_schema.

    Returns:
        datapath._CatalogWrapper: Path builder object scoped to the specified domain schema.

    Raises:
        DerivaMLException: If no schema specified and default_schema is not set.

    Example:
        >>> domain = ml.domain_path()  # Uses default schema
        >>> results = domain.my_table.entities().fetch()
        >>> # Or with explicit schema:
        >>> domain = ml.domain_path("my_schema")
    """
    schema = schema or self.model._require_default_schema()
    return self.pathBuilder().schemas[schema]

download_dataset_bag

download_dataset_bag(
    dataset: DatasetSpec,
) -> "DatasetBag"

Downloads a dataset to the local filesystem.

Downloads a dataset specified by DatasetSpec to the local filesystem. If the catalog has s3_bucket configured and use_minid is enabled, the bag will be uploaded to S3 and registered with the MINID service.

Parameters:

Name Type Description Default
dataset DatasetSpec

Specification of the dataset to download, including version and materialization options.

required

Returns:

Name Type Description
DatasetBag 'DatasetBag'

Object containing: - path: Local filesystem path to downloaded dataset - rid: Dataset's Resource Identifier - minid: Dataset's Minimal Viable Identifier (if MINID enabled)

Note

MINID support requires s3_bucket to be configured when creating the DerivaML instance. The catalog's use_minid setting controls whether MINIDs are created.

Examples:

Download with default options: >>> spec = DatasetSpec(rid="1-abc123") >>> bag = ml.download_dataset_bag(dataset=spec) >>> print(f"Downloaded to {bag.path}")

Source code in src/deriva_ml/core/mixins/dataset.py
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
def download_dataset_bag(
    self,
    dataset: DatasetSpec,
) -> "DatasetBag":
    """Downloads a dataset to the local filesystem.

    Downloads a dataset specified by DatasetSpec to the local filesystem. If the catalog
    has s3_bucket configured and use_minid is enabled, the bag will be uploaded to S3
    and registered with the MINID service.

    Args:
        dataset: Specification of the dataset to download, including version and materialization options.

    Returns:
        DatasetBag: Object containing:
            - path: Local filesystem path to downloaded dataset
            - rid: Dataset's Resource Identifier
            - minid: Dataset's Minimal Viable Identifier (if MINID enabled)

    Note:
        MINID support requires s3_bucket to be configured when creating the DerivaML instance.
        The catalog's use_minid setting controls whether MINIDs are created.

    Examples:
        Download with default options:
            >>> spec = DatasetSpec(rid="1-abc123")
            >>> bag = ml.download_dataset_bag(dataset=spec)
            >>> print(f"Downloaded to {bag.path}")
    """
    if not self.model.is_dataset_rid(dataset.rid):
        raise DerivaMLTableTypeError("Dataset", dataset.rid)
    ds = self.lookup_dataset(dataset)
    return ds.download_dataset_bag(
        version=dataset.version,
        materialize=dataset.materialize,
        use_minid=self.use_minid,
        exclude_tables=dataset.exclude_tables,
        timeout=dataset.timeout,
        fetch_concurrency=dataset.fetch_concurrency,
    )

download_dir

download_dir(
    cached: bool = False,
) -> Path

Returns the appropriate download directory.

Provides the appropriate directory path for storing downloaded files, either in the cache or working directory.

Parameters:

Name Type Description Default
cached bool

If True, returns the cache directory path. If False, returns the working directory path.

False

Returns:

Name Type Description
Path Path

Directory path where downloaded files should be stored.

Example

cache_dir = ml.download_dir(cached=True) work_dir = ml.download_dir(cached=False)

Source code in src/deriva_ml/core/base.py
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
def download_dir(self, cached: bool = False) -> Path:
    """Returns the appropriate download directory.

    Provides the appropriate directory path for storing downloaded files, either in the cache or working directory.

    Args:
        cached: If True, returns the cache directory path. If False, returns the working directory path.

    Returns:
        Path: Directory path where downloaded files should be stored.

    Example:
        >>> cache_dir = ml.download_dir(cached=True)
        >>> work_dir = ml.download_dir(cached=False)
    """
    # Return cache directory if cached=True, otherwise working directory
    return self.cache_dir if cached else self.working_dir

estimate_bag_size

estimate_bag_size(
    dataset: "DatasetSpec",
) -> dict[str, Any]

Estimate the size of a dataset bag before downloading.

Generates the same download specification used by download_dataset_bag, then runs COUNT and SUM(Length) queries against the snapshot catalog to preview what a download will contain and how large it will be.

Parameters:

Name Type Description Default
dataset 'DatasetSpec'

Specification of the dataset to estimate, including version and optional exclude_tables.

required

Returns:

Type Description
dict[str, Any]

dict with keys: - tables: dict mapping table name to {row_count, is_asset, asset_bytes} - total_rows: total row count across all tables - total_asset_bytes: total size of asset files in bytes - total_asset_size: human-readable size string (e.g., "1.2 GB")

Source code in src/deriva_ml/core/mixins/dataset.py
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
def estimate_bag_size(
    self,
    dataset: "DatasetSpec",
) -> dict[str, Any]:
    """Estimate the size of a dataset bag before downloading.

    Generates the same download specification used by download_dataset_bag,
    then runs COUNT and SUM(Length) queries against the snapshot catalog
    to preview what a download will contain and how large it will be.

    Args:
        dataset: Specification of the dataset to estimate, including version
            and optional exclude_tables.

    Returns:
        dict with keys:
            - tables: dict mapping table name to {row_count, is_asset, asset_bytes}
            - total_rows: total row count across all tables
            - total_asset_bytes: total size of asset files in bytes
            - total_asset_size: human-readable size string (e.g., "1.2 GB")
    """
    if not self.model.is_dataset_rid(dataset.rid):
        raise DerivaMLTableTypeError("Dataset", dataset.rid)
    ds = self.lookup_dataset(dataset)
    return ds.estimate_bag_size(
        version=dataset.version,
        exclude_tables=dataset.exclude_tables,
    )

feature_record_class

feature_record_class(
    table: str | Table,
    feature_name: str,
) -> type[FeatureRecord]

Returns a dynamically generated Pydantic model class for creating feature records.

Each feature has a unique set of columns based on its definition (terms, assets, metadata). This method returns a Pydantic class with fields corresponding to those columns, providing:

  • Type validation: Values are validated against expected types (str, int, float, Path)
  • Required field checking: Non-nullable columns must be provided
  • Default values: Feature_Name is pre-filled with the feature's name

Field types in the generated class: - {TargetTable} (str): Required. RID of the target record (e.g., Image RID) - Execution (str, optional): RID of the execution for provenance tracking - Feature_Name (str): Pre-filled with the feature name - Term columns (str): Accept vocabulary term names - Asset columns (str | Path): Accept asset RIDs or file paths - Value columns: Accept values matching the column type (int, float, str)

Use lookup_feature() to inspect the feature's structure and see what columns are available.

Parameters:

Name Type Description Default
table str | Table

The table containing the feature, either as name or Table object.

required
feature_name str

Name of the feature to create a record class for.

required

Returns:

Type Description
type[FeatureRecord]

type[FeatureRecord]: A Pydantic model class for creating validated feature records. The class name follows the pattern {TargetTable}Feature{FeatureName}.

Raises:

Type Description
DerivaMLException

If the feature doesn't exist or the table is invalid.

Example

Get the dynamically generated class

DiagnosisFeature = ml.feature_record_class("Image", "Diagnosis")

Create a validated feature record

record = DiagnosisFeature( ... Image="1-ABC", # Target record RID ... Diagnosis_Type="Normal", # Vocabulary term ... confidence=0.95, # Metadata column ... Execution="2-XYZ" # Provenance ... )

Convert to dict for insertion

record.model_dump() {'Image': '1-ABC', 'Diagnosis_Type': 'Normal', 'confidence': 0.95, ...}

Source code in src/deriva_ml/core/mixins/feature.py
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
def feature_record_class(self, table: str | Table, feature_name: str) -> type[FeatureRecord]:
    """Returns a dynamically generated Pydantic model class for creating feature records.

    Each feature has a unique set of columns based on its definition (terms, assets, metadata).
    This method returns a Pydantic class with fields corresponding to those columns, providing:

    - **Type validation**: Values are validated against expected types (str, int, float, Path)
    - **Required field checking**: Non-nullable columns must be provided
    - **Default values**: Feature_Name is pre-filled with the feature's name

    **Field types in the generated class:**
    - `{TargetTable}` (str): Required. RID of the target record (e.g., Image RID)
    - `Execution` (str, optional): RID of the execution for provenance tracking
    - `Feature_Name` (str): Pre-filled with the feature name
    - Term columns (str): Accept vocabulary term names
    - Asset columns (str | Path): Accept asset RIDs or file paths
    - Value columns: Accept values matching the column type (int, float, str)

    Use `lookup_feature()` to inspect the feature's structure and see what columns
    are available.

    Args:
        table: The table containing the feature, either as name or Table object.
        feature_name: Name of the feature to create a record class for.

    Returns:
        type[FeatureRecord]: A Pydantic model class for creating validated feature records.
            The class name follows the pattern `{TargetTable}Feature{FeatureName}`.

    Raises:
        DerivaMLException: If the feature doesn't exist or the table is invalid.

    Example:
        >>> # Get the dynamically generated class
        >>> DiagnosisFeature = ml.feature_record_class("Image", "Diagnosis")
        >>>
        >>> # Create a validated feature record
        >>> record = DiagnosisFeature(
        ...     Image="1-ABC",           # Target record RID
        ...     Diagnosis_Type="Normal", # Vocabulary term
        ...     confidence=0.95,         # Metadata column
        ...     Execution="2-XYZ"        # Provenance
        ... )
        >>>
        >>> # Convert to dict for insertion
        >>> record.model_dump()
        {'Image': '1-ABC', 'Diagnosis_Type': 'Normal', 'confidence': 0.95, ...}
    """
    # Look up a feature and return its record class
    return self.lookup_feature(table, feature_name).feature_record_class()

fetch_table_features

fetch_table_features(
    table: Table | str,
    feature_name: str | None = None,
    selector: Callable[
        [list[FeatureRecord]],
        FeatureRecord,
    ]
    | None = None,
) -> dict[str, list[FeatureRecord]]

Fetch all feature values for a table, grouped by feature name.

Returns a dictionary mapping feature names to lists of FeatureRecord instances. This is useful for retrieving all annotations on a table in a single call — for example, getting all classification labels, quality scores, and bounding boxes for a set of images at once.

Selector for resolving multiple values:

An asset may have multiple values for the same feature — for example, labels from different annotators, or predictions from successive model runs. When a selector is provided, records are grouped by target RID and the selector is called once per group to pick a single value. Groups with only one record are passed through unchanged.

A selector is any callable with signature (list[FeatureRecord]) -> FeatureRecord. Built-in selectors:

  • FeatureRecord.select_newest — picks the record with the most recent RCT (Row Creation Time).

Custom selector example::

def select_highest_confidence(records):
    return max(records, key=lambda r: getattr(r, "Confidence", 0))

For workflow-aware selection, see select_by_workflow().

Parameters:

Name Type Description Default
table Table | str

The table to fetch features for (name or Table object).

required
feature_name str | None

If provided, only fetch values for this specific feature. If None, fetch all features on the table.

None
selector Callable[[list[FeatureRecord]], FeatureRecord] | None

Optional function to select among multiple feature values for the same target object. Receives a list of FeatureRecord instances (all for the same target RID) and returns the selected one.

None

Returns:

Type Description
dict[str, list[FeatureRecord]]

dict[str, list[FeatureRecord]]: Keys are feature names, values are

dict[str, list[FeatureRecord]]

lists of FeatureRecord instances. When a selector is provided, each

dict[str, list[FeatureRecord]]

target object appears at most once per feature.

Raises:

Type Description
DerivaMLException

If a specified feature_name doesn't exist on the table.

Examples:

Fetch all features for a table::

>>> features = ml.fetch_table_features("Image")
>>> for name, records in features.items():
...     print(f"{name}: {len(records)} values")

Fetch a single feature with newest-value selection::

>>> features = ml.fetch_table_features(
...     "Image",
...     feature_name="Classification",
...     selector=FeatureRecord.select_newest,
... )

Convert results to a DataFrame::

>>> features = ml.fetch_table_features("Image", feature_name="Quality")
>>> import pandas as pd
>>> df = pd.DataFrame([r.model_dump() for r in features["Quality"]])
Source code in src/deriva_ml/core/mixins/feature.py
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
def fetch_table_features(
    self,
    table: Table | str,
    feature_name: str | None = None,
    selector: Callable[[list[FeatureRecord]], FeatureRecord] | None = None,
) -> dict[str, list[FeatureRecord]]:
    """Fetch all feature values for a table, grouped by feature name.

    Returns a dictionary mapping feature names to lists of FeatureRecord
    instances. This is useful for retrieving all annotations on a table
    in a single call — for example, getting all classification labels,
    quality scores, and bounding boxes for a set of images at once.

    **Selector for resolving multiple values:**

    An asset may have multiple values for the same feature — for example,
    labels from different annotators, or predictions from successive model
    runs. When a ``selector`` is provided, records are grouped by target
    RID and the selector is called once per group to pick a single value.
    Groups with only one record are passed through unchanged.

    A selector is any callable with signature
    ``(list[FeatureRecord]) -> FeatureRecord``. Built-in selectors:

    - ``FeatureRecord.select_newest`` — picks the record with the most
      recent ``RCT`` (Row Creation Time).

    Custom selector example::

        def select_highest_confidence(records):
            return max(records, key=lambda r: getattr(r, "Confidence", 0))

    For workflow-aware selection, see ``select_by_workflow()``.

    Args:
        table: The table to fetch features for (name or Table object).
        feature_name: If provided, only fetch values for this specific
            feature. If ``None``, fetch all features on the table.
        selector: Optional function to select among multiple feature values
            for the same target object. Receives a list of FeatureRecord
            instances (all for the same target RID) and returns the selected
            one.

    Returns:
        dict[str, list[FeatureRecord]]: Keys are feature names, values are
        lists of FeatureRecord instances. When a selector is provided, each
        target object appears at most once per feature.

    Raises:
        DerivaMLException: If a specified ``feature_name`` doesn't exist
            on the table.

    Examples:
        Fetch all features for a table::

            >>> features = ml.fetch_table_features("Image")
            >>> for name, records in features.items():
            ...     print(f"{name}: {len(records)} values")

        Fetch a single feature with newest-value selection::

            >>> features = ml.fetch_table_features(
            ...     "Image",
            ...     feature_name="Classification",
            ...     selector=FeatureRecord.select_newest,
            ... )

        Convert results to a DataFrame::

            >>> features = ml.fetch_table_features("Image", feature_name="Quality")
            >>> import pandas as pd
            >>> df = pd.DataFrame([r.model_dump() for r in features["Quality"]])
    """
    table = self.model.name_to_table(table)
    features = self.find_features(table)
    if feature_name is not None:
        features = [f for f in features if f.feature_name == feature_name]
        if not features:
            raise DerivaMLException(
                f"Feature '{feature_name}' not found on table '{table.name}'."
            )

    result: dict[str, list[FeatureRecord]] = {}

    for feat in features:
        record_class = feat.feature_record_class()
        field_names = set(record_class.model_fields.keys())
        target_col = feat.target_table.name

        # Query all feature values
        pb = self.pathBuilder()
        raw_values = (
            pb.schemas[feat.feature_table.schema.name]
            .tables[feat.feature_table.name]
            .entities()
            .fetch()
        )

        records: list[FeatureRecord] = []
        for raw_value in raw_values:
            filtered_data = {k: v for k, v in raw_value.items() if k in field_names}
            records.append(record_class(**filtered_data))

        if selector and records:
            # Group by target RID and apply selector
            grouped: dict[str, list[FeatureRecord]] = defaultdict(list)
            for rec in records:
                target_rid = getattr(rec, target_col, None)
                if target_rid is not None:
                    grouped[target_rid].append(rec)
            records = [
                selector(group) if len(group) > 1 else group[0]
                for group in grouped.values()
            ]

        result[feat.feature_name] = records

    return result

find_assets

find_assets(
    asset_table: Table
    | str
    | None = None,
    asset_type: str | None = None,
) -> Iterable["Asset"]

Find assets in the catalog.

Returns an iterable of Asset objects matching the specified criteria. If no criteria are specified, returns all assets from all asset tables.

Parameters:

Name Type Description Default
asset_table Table | str | None

Optional table or table name to search. If None, searches all asset tables.

None
asset_type str | None

Optional asset type to filter by. Only returns assets with this type.

None

Returns:

Type Description
Iterable['Asset']

Iterable of Asset objects matching the criteria.

Example

Find all assets in the Model table

models = list(ml.find_assets(asset_table="Model"))

Find all assets with type "Training_Data"

training = list(ml.find_assets(asset_type="Training_Data"))

Find all assets across all tables

all_assets = list(ml.find_assets())

Source code in src/deriva_ml/core/mixins/asset.py
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
def find_assets(
    self,
    asset_table: Table | str | None = None,
    asset_type: str | None = None,
) -> Iterable["Asset"]:
    """Find assets in the catalog.

    Returns an iterable of Asset objects matching the specified criteria.
    If no criteria are specified, returns all assets from all asset tables.

    Args:
        asset_table: Optional table or table name to search. If None, searches
            all asset tables.
        asset_type: Optional asset type to filter by. Only returns assets
            with this type.

    Returns:
        Iterable of Asset objects matching the criteria.

    Example:
        >>> # Find all assets in the Model table
        >>> models = list(ml.find_assets(asset_table="Model"))

        >>> # Find all assets with type "Training_Data"
        >>> training = list(ml.find_assets(asset_type="Training_Data"))

        >>> # Find all assets across all tables
        >>> all_assets = list(ml.find_assets())
    """
    # Determine which tables to search
    if asset_table is not None:
        tables = [self.model.name_to_table(asset_table)]
    else:
        tables = self.list_asset_tables()

    for table in tables:
        # Get all assets from this table (now returns Asset objects)
        for asset in self.list_assets(table):
            # Filter by asset type if specified
            if asset_type is not None:
                if asset_type not in asset.asset_types:
                    continue
            yield asset

find_datasets

find_datasets(
    deleted: bool = False,
) -> Iterable["Dataset"]

List all datasets in the catalog.

Parameters:

Name Type Description Default
deleted bool

If True, include datasets that have been marked as deleted.

False

Returns:

Type Description
Iterable['Dataset']

Iterable of Dataset objects.

Example

datasets = list(ml.find_datasets()) for ds in datasets: ... print(f"{ds.dataset_rid}: {ds.description}")

Source code in src/deriva_ml/core/mixins/dataset.py
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
def find_datasets(self, deleted: bool = False) -> Iterable["Dataset"]:
    """List all datasets in the catalog.

    Args:
        deleted: If True, include datasets that have been marked as deleted.

    Returns:
        Iterable of Dataset objects.

    Example:
        >>> datasets = list(ml.find_datasets())
        >>> for ds in datasets:
        ...     print(f"{ds.dataset_rid}: {ds.description}")
    """
    # Import here to avoid circular imports
    from deriva_ml.dataset.dataset import Dataset

    # Get datapath to the Dataset table
    pb = self.pathBuilder()
    dataset_path = pb.schemas[self._dataset_table.schema.name].tables[self._dataset_table.name]

    if deleted:
        filtered_path = dataset_path
    else:
        filtered_path = dataset_path.filter(
            (dataset_path.Deleted == False) | (dataset_path.Deleted == None)  # noqa: E711, E712
        )

    # Create Dataset objects - dataset_types is now a property that fetches from catalog
    datasets = []
    for dataset in filtered_path.entities().fetch():
        datasets.append(
            Dataset(
                self,  # type: ignore[arg-type]
                dataset_rid=dataset["RID"],
                description=dataset["Description"],
            )
        )
    return datasets

find_experiments

find_experiments(
    workflow_rid: RID | None = None,
    status: Status | None = None,
) -> Iterable["Experiment"]

List all experiments (executions with Hydra configuration) in the catalog.

Creates Experiment objects for analyzing completed ML model runs. Only returns executions that have Hydra configuration metadata (i.e., a config.yaml file in Execution_Metadata assets).

Parameters:

Name Type Description Default
workflow_rid RID | None

Optional workflow RID to filter by.

None
status Status | None

Optional status to filter by (e.g., Status.Completed).

None

Returns:

Type Description
Iterable['Experiment']

Iterable of Experiment objects for executions with Hydra config.

Example

experiments = list(ml.find_experiments(status=Status.Completed)) for exp in experiments: ... print(f"{exp.name}: {exp.config_choices}")

Source code in src/deriva_ml/core/mixins/execution.py
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
def find_experiments(
    self,
    workflow_rid: RID | None = None,
    status: Status | None = None,
) -> Iterable["Experiment"]:
    """List all experiments (executions with Hydra configuration) in the catalog.

    Creates Experiment objects for analyzing completed ML model runs.
    Only returns executions that have Hydra configuration metadata
    (i.e., a config.yaml file in Execution_Metadata assets).

    Args:
        workflow_rid: Optional workflow RID to filter by.
        status: Optional status to filter by (e.g., Status.Completed).

    Returns:
        Iterable of Experiment objects for executions with Hydra config.

    Example:
        >>> experiments = list(ml.find_experiments(status=Status.Completed))
        >>> for exp in experiments:
        ...     print(f"{exp.name}: {exp.config_choices}")
    """
    import re

    from deriva_ml.experiment import Experiment

    # Get datapath to tables
    pb = self.pathBuilder()
    execution_path = pb.schemas[self.ml_schema].Execution
    metadata_path = pb.schemas[self.ml_schema].Execution_Metadata
    meta_exec_path = pb.schemas[self.ml_schema].Execution_Metadata_Execution

    # Find executions that have metadata assets with config.yaml files
    # Query the association table to find executions with hydra config metadata
    exec_rids_with_config = set()

    # Get all metadata records and filter for config.yaml files in Python
    # (ERMrest regex support varies by deployment)
    config_pattern = re.compile(r".*-config\.yaml$")
    config_metadata_rids = set()
    for meta in metadata_path.entities().fetch():
        filename = meta.get("Filename", "")
        if filename and config_pattern.match(filename):
            config_metadata_rids.add(meta["RID"])

    if config_metadata_rids:
        # Query the association table to find which executions have these metadata
        for assoc_record in meta_exec_path.entities().fetch():
            if assoc_record.get("Execution_Metadata") in config_metadata_rids:
                exec_rids_with_config.add(assoc_record["Execution"])

    # Apply additional filters and yield Experiment objects
    filtered_path = execution_path
    if workflow_rid:
        filtered_path = filtered_path.filter(execution_path.Workflow == workflow_rid)
    if status:
        filtered_path = filtered_path.filter(execution_path.Status == status.value)

    for exec_record in filtered_path.entities().fetch():
        if exec_record["RID"] in exec_rids_with_config:
            yield Experiment(self, exec_record["RID"])  # type: ignore[arg-type]

find_features

find_features(
    table: str | Table | None = None,
) -> list[Feature]

Find feature definitions in the schema.

Discovers features by inspecting the catalog schema for association tables that have Feature_Name and Execution columns. Returns Feature objects describing each feature's structure (target table, term/asset/value columns), not the feature values themselves.

Use fetch_table_features or list_feature_values to retrieve actual feature values.

Parameters:

Name Type Description Default
table str | Table | None

Optional table to find features for. If None, returns all feature definitions across all tables.

None

Returns:

Type Description
list[Feature]

A list of Feature instances describing the feature definitions.

Examples:

Find all feature definitions: >>> all_features = ml.find_features() >>> for f in all_features: ... print(f"{f.target_table.name}.{f.feature_name}")

Find features defined on a specific table: >>> image_features = ml.find_features("Image") >>> print([f.feature_name for f in image_features])

Source code in src/deriva_ml/core/mixins/feature.py
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
def find_features(self, table: str | Table | None = None) -> list[Feature]:
    """Find feature definitions in the schema.

    Discovers features by inspecting the catalog schema for association tables
    that have ``Feature_Name`` and ``Execution`` columns. Returns Feature objects
    describing each feature's structure (target table, term/asset/value columns),
    not the feature values themselves.

    Use ``fetch_table_features`` or ``list_feature_values`` to retrieve actual
    feature values.

    Args:
        table: Optional table to find features for. If None, returns all feature
            definitions across all tables.

    Returns:
        A list of Feature instances describing the feature definitions.

    Examples:
        Find all feature definitions:
            >>> all_features = ml.find_features()
            >>> for f in all_features:
            ...     print(f"{f.target_table.name}.{f.feature_name}")

        Find features defined on a specific table:
            >>> image_features = ml.find_features("Image")
            >>> print([f.feature_name for f in image_features])
    """
    return list(self.model.find_features(table))

find_workflows

find_workflows() -> list[Workflow]

Find all workflows in the catalog.

Catalog-level operation to find all workflow definitions, including their names, URLs, types, versions, and descriptions. Each returned Workflow is bound to the catalog, allowing its description to be updated.

Returns:

Type Description
list[Workflow]

list[Workflow]: List of workflow objects, each containing: - name: Workflow name - url: Source code URL - workflow_type: Type(s) of workflow - version: Version identifier - description: Workflow description - rid: Resource identifier - checksum: Source code checksum

Examples:

List all workflows and their descriptions::

>>> workflows = ml.find_workflows()
>>> for w in workflows:
...     print(f"{w.name} (v{w.version}): {w.description}")
...     print(f"  Source: {w.url}")

Update a workflow's description (workflows are catalog-bound)::

>>> workflows = ml.find_workflows()
>>> workflows[0].description = "Updated description"
Source code in src/deriva_ml/core/mixins/workflow.py
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
def find_workflows(self) -> list[Workflow]:
    """Find all workflows in the catalog.

    Catalog-level operation to find all workflow definitions, including their
    names, URLs, types, versions, and descriptions. Each returned Workflow
    is bound to the catalog, allowing its description to be updated.

    Returns:
        list[Workflow]: List of workflow objects, each containing:
            - name: Workflow name
            - url: Source code URL
            - workflow_type: Type(s) of workflow
            - version: Version identifier
            - description: Workflow description
            - rid: Resource identifier
            - checksum: Source code checksum

    Examples:
        List all workflows and their descriptions::

            >>> workflows = ml.find_workflows()
            >>> for w in workflows:
            ...     print(f"{w.name} (v{w.version}): {w.description}")
            ...     print(f"  Source: {w.url}")

        Update a workflow's description (workflows are catalog-bound)::

            >>> workflows = ml.find_workflows()
            >>> workflows[0].description = "Updated description"
    """
    # Get a workflow table path and fetch all workflows
    workflow_path = self.pathBuilder().schemas[self.ml_schema].Workflow
    workflows = []
    for w in workflow_path.entities().fetch():
        workflow_types = self._get_workflow_types_for_rid(w["RID"])
        workflow = Workflow(
            name=w["Name"],
            url=w["URL"],
            workflow_type=workflow_types,
            version=w["Version"],
            description=w["Description"],
            rid=w["RID"],
            checksum=w["Checksum"],
        )
        # Bind the workflow to this catalog instance
        workflow._ml_instance = self  # type: ignore[assignment]
        workflows.append(workflow)
    return workflows

from_context classmethod

from_context(
    path: Path | str | None = None,
) -> Self

Create a DerivaML instance from a .deriva-context.json file.

Searches for .deriva-context.json starting from path (default: cwd), walking up parent directories. This enables scripts generated by Claude to connect to the same catalog without hardcoding connection details.

The context file is written by the MCP server's connect_catalog tool and contains hostname, catalog_id, and default_schema.

Parameters:

Name Type Description Default
path Path | str | None

Starting directory to search for the context file. Defaults to the current working directory.

None

Returns:

Type Description
Self

A new DerivaML instance configured from the context file.

Raises:

Type Description
FileNotFoundError

If no .deriva-context.json is found.

Example::

# In a script generated by Claude:
from deriva_ml import DerivaML
ml = DerivaML.from_context()
subjects = ml.cache_table("Subject")
Source code in src/deriva_ml/core/base.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
@classmethod
def from_context(cls, path: Path | str | None = None) -> Self:
    """Create a DerivaML instance from a .deriva-context.json file.

    Searches for .deriva-context.json starting from ``path`` (default: cwd),
    walking up parent directories. This enables scripts generated by Claude
    to connect to the same catalog without hardcoding connection details.

    The context file is written by the MCP server's ``connect_catalog`` tool
    and contains hostname, catalog_id, and default_schema.

    Args:
        path: Starting directory to search for the context file.
            Defaults to the current working directory.

    Returns:
        A new DerivaML instance configured from the context file.

    Raises:
        FileNotFoundError: If no .deriva-context.json is found.

    Example::

        # In a script generated by Claude:
        from deriva_ml import DerivaML
        ml = DerivaML.from_context()
        subjects = ml.cache_table("Subject")
    """
    import json

    start = Path(path) if path else Path.cwd()
    context_file = _find_context_file(start)
    with open(context_file) as f:
        ctx = json.load(f)

    kwargs: dict[str, Any] = {
        "hostname": ctx["hostname"],
        "catalog_id": ctx["catalog_id"],
    }
    if ctx.get("default_schema"):
        kwargs["default_schema"] = ctx["default_schema"]
    if ctx.get("working_dir"):
        kwargs["working_dir"] = ctx["working_dir"]

    return cls(**kwargs)

get_cache_size

get_cache_size() -> dict[
    str, int | float
]

Get the current size of the cache directory.

Returns:

Type Description
dict[str, int | float]

dict with keys: - 'total_bytes': Total size in bytes - 'total_mb': Total size in megabytes - 'file_count': Number of files - 'dir_count': Number of directories

Example

ml = DerivaML('deriva.example.org', 'my_catalog') size = ml.get_cache_size() print(f"Cache size: {size['total_mb']:.1f} MB ({size['file_count']} files)")

Source code in src/deriva_ml/core/base.py
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
def get_cache_size(self) -> dict[str, int | float]:
    """Get the current size of the cache directory.

    Returns:
        dict with keys:
            - 'total_bytes': Total size in bytes
            - 'total_mb': Total size in megabytes
            - 'file_count': Number of files
            - 'dir_count': Number of directories

    Example:
        >>> ml = DerivaML('deriva.example.org', 'my_catalog')
        >>> size = ml.get_cache_size()
        >>> print(f"Cache size: {size['total_mb']:.1f} MB ({size['file_count']} files)")
    """
    stats = {'total_bytes': 0, 'total_mb': 0.0, 'file_count': 0, 'dir_count': 0}

    if not self.cache_dir.exists():
        return stats

    for entry in self.cache_dir.rglob('*'):
        if entry.is_file():
            stats['total_bytes'] += entry.stat().st_size
            stats['file_count'] += 1
        elif entry.is_dir():
            stats['dir_count'] += 1

    stats['total_mb'] = stats['total_bytes'] / (1024 * 1024)
    return stats

get_column_annotations

get_column_annotations(
    table: str | Table, column_name: str
) -> dict[str, Any]

Get all display-related annotations for a column.

Returns the current values of display and column-display annotations for the specified column.

Parameters:

Name Type Description Default
table str | Table

Table name or Table object containing the column.

required
column_name str

Name of the column.

required

Returns:

Type Description
dict[str, Any]

Dictionary with keys: table, column, display, column_display.

dict[str, Any]

Missing annotations are None.

Example

annotations = ml.get_column_annotations("Image", "Filename") print(annotations["display"])

Source code in src/deriva_ml/core/mixins/annotation.py
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def get_column_annotations(self, table: str | Table, column_name: str) -> dict[str, Any]:
    """Get all display-related annotations for a column.

    Returns the current values of display and column-display annotations
    for the specified column.

    Args:
        table: Table name or Table object containing the column.
        column_name: Name of the column.

    Returns:
        Dictionary with keys: table, column, display, column_display.
        Missing annotations are None.

    Example:
        >>> annotations = ml.get_column_annotations("Image", "Filename")
        >>> print(annotations["display"])
    """
    table_obj = self.model.name_to_table(table)
    column = table_obj.columns[column_name]
    return {
        "table": table_obj.name,
        "column": column.name,
        "display": column.annotations.get(DISPLAY_TAG),
        "column_display": column.annotations.get(COLUMN_DISPLAY_TAG),
    }

get_handlebars_template_variables

get_handlebars_template_variables(
    table: str | Table,
) -> dict[str, Any]

Get all available template variables for a table.

Returns the columns, foreign keys, and special variables that can be used in Handlebars templates (row_markdown_pattern, markdown_pattern, etc.) for the specified table.

Parameters:

Name Type Description Default
table str | Table

Table name or Table object.

required

Returns:

Type Description
dict[str, Any]

Dictionary with columns, foreign_keys, special_variables, and helper_examples.

Example

vars = ml.get_handlebars_template_variables("Image") for col in vars["columns"]: ... print(f"{col['name']}: {col['template']}")

Source code in src/deriva_ml/core/mixins/annotation.py
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def get_handlebars_template_variables(self, table: str | Table) -> dict[str, Any]:
    """Get all available template variables for a table.

    Returns the columns, foreign keys, and special variables that can be
    used in Handlebars templates (row_markdown_pattern, markdown_pattern, etc.)
    for the specified table.

    Args:
        table: Table name or Table object.

    Returns:
        Dictionary with columns, foreign_keys, special_variables, and helper_examples.

    Example:
        >>> vars = ml.get_handlebars_template_variables("Image")
        >>> for col in vars["columns"]:
        ...     print(f"{col['name']}: {col['template']}")
    """
    table_obj = self.model.name_to_table(table)

    # Get columns
    columns = []
    for col in table_obj.columns:
        columns.append({
            "name": col.name,
            "type": str(col.type.typename),
            "template": "{{{" + col.name + "}}}",
            "row_template": "{{{_row." + col.name + "}}}",
        })

    # Get foreign keys (outbound)
    foreign_keys = []
    for fkey in table_obj.foreign_keys:
        schema_name = fkey.constraint_schema.name
        constraint_name = fkey.constraint_name
        fk_path = f"$fkeys.{schema_name}.{constraint_name}"

        # Get columns from referenced table
        ref_columns = [col.name for col in fkey.pk_table.columns]

        foreign_keys.append({
            "constraint": [schema_name, constraint_name],
            "from_columns": [col.name for col in fkey.columns],
            "to_table": fkey.pk_table.name,
            "to_columns": ref_columns,
            "values_template": "{{{" + fk_path + ".values.COLUMN}}}",
            "row_name_template": "{{{" + fk_path + ".rowName}}}",
            "example_column_templates": [
                "{{{" + fk_path + ".values." + c + "}}}"
                for c in ref_columns[:3]  # Show first 3 as examples
            ]
        })

    return {
        "table": table_obj.name,
        "columns": columns,
        "foreign_keys": foreign_keys,
        "special_variables": {
            "_value": {
                "description": "Current column value (in column_display)",
                "template": "{{{_value}}}"
            },
            "_row": {
                "description": "Object with all row columns",
                "template": "{{{_row.column_name}}}"
            },
            "$catalog.id": {
                "description": "Catalog ID",
                "template": "{{{$catalog.id}}}"
            },
            "$catalog.snapshot": {
                "description": "Current snapshot ID",
                "template": "{{{$catalog.snapshot}}}"
            },
        },
        "helper_examples": {
            "conditional": "{{#if column}}...{{else}}...{{/if}}",
            "iteration": "{{#each array}}{{{this}}}{{/each}}",
            "comparison": "{{#ifCond val1 '==' val2}}...{{/ifCond}}",
            "date_format": "{{formatDate RCT 'YYYY-MM-DD'}}",
            "json_output": "{{toJSON object}}"
        }
    }

get_storage_summary

get_storage_summary() -> dict[str, any]

Get a summary of local storage usage.

Returns:

Type Description
dict[str, any]

dict with keys: - 'working_dir': Path to working directory - 'cache_dir': Path to cache directory - 'cache_size_mb': Cache size in MB - 'cache_file_count': Number of files in cache - 'execution_dir_count': Number of execution directories - 'execution_size_mb': Total size of execution directories in MB - 'total_size_mb': Combined size in MB

Example

ml = DerivaML('deriva.example.org', 'my_catalog') summary = ml.get_storage_summary() print(f"Total storage: {summary['total_size_mb']:.1f} MB") print(f" Cache: {summary['cache_size_mb']:.1f} MB") print(f" Executions: {summary['execution_size_mb']:.1f} MB")

Source code in src/deriva_ml/core/base.py
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
def get_storage_summary(self) -> dict[str, any]:
    """Get a summary of local storage usage.

    Returns:
        dict with keys:
            - 'working_dir': Path to working directory
            - 'cache_dir': Path to cache directory
            - 'cache_size_mb': Cache size in MB
            - 'cache_file_count': Number of files in cache
            - 'execution_dir_count': Number of execution directories
            - 'execution_size_mb': Total size of execution directories in MB
            - 'total_size_mb': Combined size in MB

    Example:
        >>> ml = DerivaML('deriva.example.org', 'my_catalog')
        >>> summary = ml.get_storage_summary()
        >>> print(f"Total storage: {summary['total_size_mb']:.1f} MB")
        >>> print(f"  Cache: {summary['cache_size_mb']:.1f} MB")
        >>> print(f"  Executions: {summary['execution_size_mb']:.1f} MB")
    """
    cache_stats = self.get_cache_size()
    exec_dirs = self.list_execution_dirs()

    exec_size_mb = sum(d['size_mb'] for d in exec_dirs)

    return {
        'working_dir': str(self.working_dir),
        'cache_dir': str(self.cache_dir),
        'cache_size_mb': cache_stats['total_mb'],
        'cache_file_count': cache_stats['file_count'],
        'execution_dir_count': len(exec_dirs),
        'execution_size_mb': exec_size_mb,
        'total_size_mb': cache_stats['total_mb'] + exec_size_mb,
    }

get_table_annotations

get_table_annotations(
    table: str | Table,
) -> dict[str, Any]

Get all display-related annotations for a table.

Returns the current values of display, visible-columns, visible-foreign-keys, and table-display annotations for the specified table.

Parameters:

Name Type Description Default
table str | Table

Table name or Table object.

required

Returns:

Type Description
dict[str, Any]

Dictionary with keys: table, schema, display, visible_columns,

dict[str, Any]

visible_foreign_keys, table_display. Missing annotations are None.

Example

annotations = ml.get_table_annotations("Image") print(annotations["visible_columns"])

Source code in src/deriva_ml/core/mixins/annotation.py
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def get_table_annotations(self, table: str | Table) -> dict[str, Any]:
    """Get all display-related annotations for a table.

    Returns the current values of display, visible-columns, visible-foreign-keys,
    and table-display annotations for the specified table.

    Args:
        table: Table name or Table object.

    Returns:
        Dictionary with keys: table, schema, display, visible_columns,
        visible_foreign_keys, table_display. Missing annotations are None.

    Example:
        >>> annotations = ml.get_table_annotations("Image")
        >>> print(annotations["visible_columns"])
    """
    table_obj = self.model.name_to_table(table)
    return {
        "table": table_obj.name,
        "schema": table_obj.schema.name,
        "display": table_obj.annotations.get(DISPLAY_TAG),
        "visible_columns": table_obj.annotations.get(VISIBLE_COLUMNS_TAG),
        "visible_foreign_keys": table_obj.annotations.get(VISIBLE_FOREIGN_KEYS_TAG),
        "table_display": table_obj.annotations.get(TABLE_DISPLAY_TAG),
    }

get_table_as_dataframe

get_table_as_dataframe(
    table: str,
) -> pd.DataFrame

Get table contents as a pandas DataFrame.

Retrieves all contents of a table from the catalog.

Parameters:

Name Type Description Default
table str

Name of the table to retrieve.

required

Returns:

Type Description
DataFrame

DataFrame containing all table contents.

Source code in src/deriva_ml/core/mixins/path_builder.py
119
120
121
122
123
124
125
126
127
128
129
130
def get_table_as_dataframe(self, table: str) -> pd.DataFrame:
    """Get table contents as a pandas DataFrame.

    Retrieves all contents of a table from the catalog.

    Args:
        table: Name of the table to retrieve.

    Returns:
        DataFrame containing all table contents.
    """
    return pd.DataFrame(list(self.get_table_as_dict(table)))

get_table_as_dict

get_table_as_dict(
    table: str,
) -> Iterable[dict[str, Any]]

Get table contents as dictionaries.

Retrieves all contents of a table from the catalog.

Parameters:

Name Type Description Default
table str

Name of the table to retrieve.

required

Returns:

Type Description
Iterable[dict[str, Any]]

Iterable yielding dictionaries for each row.

Source code in src/deriva_ml/core/mixins/path_builder.py
132
133
134
135
136
137
138
139
140
141
142
143
144
145
def get_table_as_dict(self, table: str) -> Iterable[dict[str, Any]]:
    """Get table contents as dictionaries.

    Retrieves all contents of a table from the catalog.

    Args:
        table: Name of the table to retrieve.

    Returns:
        Iterable yielding dictionaries for each row.
    """
    table_obj = self.model.name_to_table(table)
    pb = self.pathBuilder()
    yield from pb.schemas[table_obj.schema.name].tables[table_obj.name].entities().fetch()

globus_login staticmethod

globus_login(host: str) -> None

Authenticate with Globus to obtain credentials for a Deriva server.

Initiates a Globus Native Login flow to obtain OAuth2 tokens required by the Deriva server. The flow uses a device-code grant (no browser or local server), and stores refresh tokens so that subsequent calls can re-authenticate silently. The BDBag keychain is also updated so that bag downloads can use the same credentials.

If the user is already logged in for the given host, a message is printed and no further action is taken.

Parameters:

Name Type Description Default
host str

Hostname of the Deriva server to authenticate with (e.g., "www.eye-ai.org").

required
Example

DerivaML.globus_login('www.eye-ai.org') 'Login Successful'

Source code in src/deriva_ml/core/base.py
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
@staticmethod
def globus_login(host: str) -> None:
    """Authenticate with Globus to obtain credentials for a Deriva server.

    Initiates a Globus Native Login flow to obtain OAuth2 tokens required
    by the Deriva server.  The flow uses a device-code grant (no browser
    or local server), and stores refresh tokens so that subsequent calls
    can re-authenticate silently.  The BDBag keychain is also updated so
    that bag downloads can use the same credentials.

    If the user is already logged in for the given host, a message is
    printed and no further action is taken.

    Args:
        host: Hostname of the Deriva server to authenticate with
            (e.g., ``"www.eye-ai.org"``).

    Example:
        >>> DerivaML.globus_login('www.eye-ai.org')
        'Login Successful'
    """
    gnl = GlobusNativeLogin(host=host)
    if gnl.is_logged_in([host]):
        print("You are already logged in.")
    else:
        gnl.login(
            [host],
            no_local_server=True,
            no_browser=True,
            refresh_tokens=True,
            update_bdbag_keychain=True,
        )
        print("Login Successful")

instantiate classmethod

instantiate(
    config: DerivaMLConfig,
) -> Self

Create a DerivaML instance from a configuration object.

This method is the preferred way to instantiate DerivaML when using hydra-zen for configuration management. It accepts a DerivaMLConfig (Pydantic model) and unpacks it to create the instance.

This pattern allows hydra-zen's instantiate() to work with DerivaML:

Example with hydra-zen

from hydra_zen import builds, instantiate from deriva_ml import DerivaML from deriva_ml.core.config import DerivaMLConfig

Create a structured config using hydra-zen

DerivaMLConf = builds(DerivaMLConfig, populate_full_signature=True)

Configure for your environment

conf = DerivaMLConf( ... hostname='deriva.example.org', ... catalog_id='42', ... domain_schema='my_domain', ... )

Instantiate the config to get a DerivaMLConfig object

config = instantiate(conf)

Create the DerivaML instance

ml = DerivaML.instantiate(config)

Parameters:

Name Type Description Default
config DerivaMLConfig

A DerivaMLConfig object containing all configuration parameters.

required

Returns:

Type Description
Self

A new DerivaML instance configured according to the config object.

Note

The DerivaMLConfig class integrates with Hydra's configuration system and registers custom resolvers for computing working directories. See deriva_ml.core.config for details on configuration options.

Source code in src/deriva_ml/core/base.py
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
@classmethod
def instantiate(cls, config: DerivaMLConfig) -> Self:
    """Create a DerivaML instance from a configuration object.

    This method is the preferred way to instantiate DerivaML when using hydra-zen
    for configuration management. It accepts a DerivaMLConfig (Pydantic model) and
    unpacks it to create the instance.

    This pattern allows hydra-zen's `instantiate()` to work with DerivaML:

    Example with hydra-zen:
        >>> from hydra_zen import builds, instantiate
        >>> from deriva_ml import DerivaML
        >>> from deriva_ml.core.config import DerivaMLConfig
        >>>
        >>> # Create a structured config using hydra-zen
        >>> DerivaMLConf = builds(DerivaMLConfig, populate_full_signature=True)
        >>>
        >>> # Configure for your environment
        >>> conf = DerivaMLConf(
        ...     hostname='deriva.example.org',
        ...     catalog_id='42',
        ...     domain_schema='my_domain',
        ... )
        >>>
        >>> # Instantiate the config to get a DerivaMLConfig object
        >>> config = instantiate(conf)
        >>>
        >>> # Create the DerivaML instance
        >>> ml = DerivaML.instantiate(config)

    Args:
        config: A DerivaMLConfig object containing all configuration parameters.

    Returns:
        A new DerivaML instance configured according to the config object.

    Note:
        The DerivaMLConfig class integrates with Hydra's configuration system
        and registers custom resolvers for computing working directories.
        See `deriva_ml.core.config` for details on configuration options.
    """
    return cls(**config.model_dump())

is_snapshot

is_snapshot() -> bool

Check whether this DerivaML instance is connected to a catalog snapshot.

Returns:

Type Description
bool

True if the underlying catalog has a snapshot timestamp, False otherwise.

Source code in src/deriva_ml/core/base.py
382
383
384
385
386
387
388
def is_snapshot(self) -> bool:
    """Check whether this DerivaML instance is connected to a catalog snapshot.

    Returns:
        True if the underlying catalog has a snapshot timestamp, False otherwise.
    """
    return hasattr(self.catalog, "_snaptime")

list_asset_executions

list_asset_executions(
    asset_rid: str,
    asset_role: str | None = None,
) -> list["ExecutionRecord"]

List all executions associated with an asset.

Given an asset RID, returns a list of executions that created or used the asset, along with the role (Input/Output) in each execution.

Parameters:

Name Type Description Default
asset_rid str

The RID of the asset to look up.

required
asset_role str | None

Optional filter for asset role ('Input' or 'Output'). If None, returns all associations.

None

Returns:

Type Description
list['ExecutionRecord']

list[ExecutionRecord]: List of ExecutionRecord objects for the executions associated with this asset.

Raises:

Type Description
DerivaMLException

If the asset RID is not found or not an asset.

Example

Find all executions that created this asset

executions = ml.list_asset_executions("1-abc123", asset_role="Output") for exe in executions: ... print(f"Created by execution {exe.execution_rid}")

Find all executions that used this asset as input

executions = ml.list_asset_executions("1-abc123", asset_role="Input")

Source code in src/deriva_ml/core/mixins/asset.py
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
def list_asset_executions(
    self, asset_rid: str, asset_role: str | None = None
) -> list["ExecutionRecord"]:
    """List all executions associated with an asset.

    Given an asset RID, returns a list of executions that created or used
    the asset, along with the role (Input/Output) in each execution.

    Args:
        asset_rid: The RID of the asset to look up.
        asset_role: Optional filter for asset role ('Input' or 'Output').
            If None, returns all associations.

    Returns:
        list[ExecutionRecord]: List of ExecutionRecord objects for the
            executions associated with this asset.

    Raises:
        DerivaMLException: If the asset RID is not found or not an asset.

    Example:
        >>> # Find all executions that created this asset
        >>> executions = ml.list_asset_executions("1-abc123", asset_role="Output")
        >>> for exe in executions:
        ...     print(f"Created by execution {exe.execution_rid}")

        >>> # Find all executions that used this asset as input
        >>> executions = ml.list_asset_executions("1-abc123", asset_role="Input")
    """
    # Resolve the RID to find which asset table it belongs to
    rid_info = self.resolve_rid(asset_rid)  # type: ignore[attr-defined]
    asset_table = rid_info.table

    if not self.model.is_asset(asset_table):
        raise DerivaMLException(f"RID {asset_rid} is not an asset (table: {asset_table.name})")

    # Find the association table between this asset table and Execution
    asset_exe_table, asset_fk, execution_fk = self.model.find_association(asset_table, "Execution")

    # Build the query
    pb = self.pathBuilder()
    asset_exe_path = pb.schemas[asset_exe_table.schema.name].tables[asset_exe_table.name]

    # Filter by asset RID
    query = asset_exe_path.filter(asset_exe_path.columns[asset_fk] == asset_rid)

    # Optionally filter by asset role
    if asset_role:
        query = query.filter(asset_exe_path.Asset_Role == asset_role)

    # Convert to ExecutionRecord objects
    records = list(query.entities().fetch())
    return [self.lookup_execution(record["Execution"]) for record in records]  # type: ignore[attr-defined]

list_asset_tables

list_asset_tables() -> list[Table]

List all asset tables in the catalog.

Returns:

Type Description
list[Table]

List of Table objects that are asset tables.

Example

for table in ml.list_asset_tables(): ... print(f"Asset table: {table.name}")

Source code in src/deriva_ml/core/mixins/asset.py
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
def list_asset_tables(self) -> list[Table]:
    """List all asset tables in the catalog.

    Returns:
        List of Table objects that are asset tables.

    Example:
        >>> for table in ml.list_asset_tables():
        ...     print(f"Asset table: {table.name}")
    """
    tables = []
    # Include asset tables from all domain schemas
    for domain_schema in self.domain_schemas:
        if domain_schema in self.model.schemas:
            tables.extend([
                t for t in self.model.schemas[domain_schema].tables.values()
                if self.model.is_asset(t)
            ])
    # Also include ML schema asset tables (like Execution_Asset)
    tables.extend([
        t for t in self.model.schemas[self.ml_schema].tables.values()
        if self.model.is_asset(t)
    ])
    return tables

list_assets

list_assets(
    asset_table: Table | str,
) -> list["Asset"]

Lists contents of an asset table.

Returns a list of Asset objects for the specified asset table.

Parameters:

Name Type Description Default
asset_table Table | str

Table or name of the asset table to list assets for.

required

Returns:

Type Description
list['Asset']

list[Asset]: List of Asset objects for the assets in the table.

Raises:

Type Description
DerivaMLException

If the table is not an asset table or doesn't exist.

Example

assets = ml.list_assets("Image") for asset in assets: ... print(f"{asset.asset_rid}: {asset.filename}")

Source code in src/deriva_ml/core/mixins/asset.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
def list_assets(self, asset_table: Table | str) -> list["Asset"]:
    """Lists contents of an asset table.

    Returns a list of Asset objects for the specified asset table.

    Args:
        asset_table: Table or name of the asset table to list assets for.

    Returns:
        list[Asset]: List of Asset objects for the assets in the table.

    Raises:
        DerivaMLException: If the table is not an asset table or doesn't exist.

    Example:
        >>> assets = ml.list_assets("Image")
        >>> for asset in assets:
        ...     print(f"{asset.asset_rid}: {asset.filename}")
    """
    from deriva_ml.asset.asset import Asset

    # Validate and get asset table reference
    asset_table_obj = self.model.name_to_table(asset_table)
    if not self.model.is_asset(asset_table_obj):
        raise DerivaMLException(f"Table {asset_table_obj.name} is not an asset")

    # Get path builders for asset and type tables
    pb = self.pathBuilder()
    asset_path = pb.schemas[asset_table_obj.schema.name].tables[asset_table_obj.name]
    (
        asset_type_table,
        _,
        _,
    ) = self.model.find_association(asset_table_obj, MLVocab.asset_type)
    type_path = pb.schemas[asset_type_table.schema.name].tables[asset_type_table.name]

    # Build a list of Asset objects
    assets = []
    for asset_record in asset_path.entities().fetch():
        # Get associated asset types for each asset
        asset_types = (
            type_path.filter(type_path.columns[asset_table_obj.name] == asset_record["RID"])
            .attributes(type_path.Asset_Type)
            .fetch()
        )
        asset_type_list = [asset_type[MLVocab.asset_type.value] for asset_type in asset_types]

        assets.append(Asset(
            catalog=self,  # type: ignore[arg-type]
            asset_rid=asset_record["RID"],
            asset_table=asset_table_obj.name,
            filename=asset_record.get("Filename", ""),
            url=asset_record.get("URL", ""),
            length=asset_record.get("Length", 0),
            md5=asset_record.get("MD5", ""),
            description=asset_record.get("Description", ""),
            asset_types=asset_type_list,
        ))
    return assets

list_dataset_element_types

list_dataset_element_types() -> (
    Iterable[Table]
)

List the types of entities that can be added to a dataset.

Returns:

Type Description
Iterable[Table]

An iterable of Table objects that can be included as an element of a dataset.

Source code in src/deriva_ml/core/mixins/dataset.py
166
167
168
169
170
171
172
173
174
175
176
def list_dataset_element_types(self) -> Iterable[Table]:
    """List the types of entities that can be added to a dataset.

    Returns:
        An iterable of Table objects that can be included as an element of a dataset.
    """

    def is_domain_or_dataset_table(table: Table) -> bool:
        return self.model.is_domain_schema(table.schema.name) or table.name == self._dataset_table.name

    return [t for a in self._dataset_table.find_associations() if is_domain_or_dataset_table(t := a.other_fkeys.pop().pk_table)]

list_execution_dirs

list_execution_dirs() -> list[
    dict[str, any]
]

List execution working directories.

Returns information about each execution directory in the working directory, useful for identifying orphaned or incomplete execution outputs.

Returns:

Type Description
list[dict[str, any]]

List of dicts, each containing: - 'execution_rid': The execution RID (directory name) - 'path': Full path to the directory - 'size_bytes': Total size in bytes - 'size_mb': Total size in megabytes - 'modified': Last modification time (datetime) - 'file_count': Number of files

Example

ml = DerivaML('deriva.example.org', 'my_catalog') dirs = ml.list_execution_dirs() for d in dirs: ... print(f"{d['execution_rid']}: {d['size_mb']:.1f} MB")

Source code in src/deriva_ml/core/base.py
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
def list_execution_dirs(self) -> list[dict[str, any]]:
    """List execution working directories.

    Returns information about each execution directory in the working directory,
    useful for identifying orphaned or incomplete execution outputs.

    Returns:
        List of dicts, each containing:
            - 'execution_rid': The execution RID (directory name)
            - 'path': Full path to the directory
            - 'size_bytes': Total size in bytes
            - 'size_mb': Total size in megabytes
            - 'modified': Last modification time (datetime)
            - 'file_count': Number of files

    Example:
        >>> ml = DerivaML('deriva.example.org', 'my_catalog')
        >>> dirs = ml.list_execution_dirs()
        >>> for d in dirs:
        ...     print(f"{d['execution_rid']}: {d['size_mb']:.1f} MB")
    """
    from datetime import datetime

    from deriva_ml.dataset.upload import upload_root

    results = []
    exec_root = upload_root(self.working_dir) / "execution"

    if not exec_root.exists():
        return results

    for entry in exec_root.iterdir():
        if entry.is_dir():
            size_bytes = sum(f.stat().st_size for f in entry.rglob('*') if f.is_file())
            file_count = sum(1 for f in entry.rglob('*') if f.is_file())
            mtime = datetime.fromtimestamp(entry.stat().st_mtime)

            results.append({
                'execution_rid': entry.name,
                'path': str(entry),
                'size_bytes': size_bytes,
                'size_mb': size_bytes / (1024 * 1024),
                'modified': mtime,
                'file_count': file_count,
            })

    return sorted(results, key=lambda x: x['modified'], reverse=True)

list_feature_values

list_feature_values(
    table: Table | str,
    feature_name: str,
    selector: Callable[
        [list[FeatureRecord]],
        FeatureRecord,
    ]
    | None = None,
) -> Iterable[FeatureRecord]

Retrieve all values for a single feature as typed FeatureRecord instances.

Convenience wrapper around fetch_table_features() for the common case of querying a single feature by name. Returns a flat list of FeatureRecord objects — one per feature value (or one per target object when a selector is provided).

Each returned record is a dynamically-generated Pydantic model with typed fields matching the feature's definition. For example, an Image_Classification feature might produce records with fields Image (str), Image_Class (str), Execution (str), RCT (str), and Feature_Name (str).

Parameters:

Name Type Description Default
table Table | str

The table the feature is defined on (name or Table object).

required
feature_name str

Name of the feature to retrieve values for.

required
selector Callable[[list[FeatureRecord]], FeatureRecord] | None

Optional function to resolve multiple values per target. See fetch_table_features for details on how selectors work. Use FeatureRecord.select_newest to pick the most recently created value.

None

Returns:

Type Description
Iterable[FeatureRecord]

Iterable[FeatureRecord]: FeatureRecord instances with:

Iterable[FeatureRecord]
  • Execution: RID of the execution that created this value
Iterable[FeatureRecord]
  • Feature_Name: Name of the feature
Iterable[FeatureRecord]
  • RCT: Row Creation Time (ISO 8601 timestamp)
Iterable[FeatureRecord]
  • Feature-specific columns as typed attributes (vocabulary terms, asset references, or value columns depending on the feature)
Iterable[FeatureRecord]
  • model_dump(): Convert to a dictionary

Raises:

Type Description
DerivaMLException

If the feature doesn't exist on the table.

Examples:

Get typed feature records::

>>> for record in ml.list_feature_values("Image", "Quality"):
...     print(f"Image {record.Image}: {record.ImageQuality}")
...     print(f"Created by execution: {record.Execution}")

Select newest when multiple values exist::

>>> records = list(ml.list_feature_values(
...     "Image", "Quality",
...     selector=FeatureRecord.select_newest,
... ))

Convert to a list of dicts::

>>> dicts = [r.model_dump() for r in
...          ml.list_feature_values("Image", "Classification")]
Source code in src/deriva_ml/core/mixins/feature.py
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def list_feature_values(
    self,
    table: Table | str,
    feature_name: str,
    selector: Callable[[list[FeatureRecord]], FeatureRecord] | None = None,
) -> Iterable[FeatureRecord]:
    """Retrieve all values for a single feature as typed FeatureRecord instances.

    Convenience wrapper around ``fetch_table_features()`` for the common
    case of querying a single feature by name. Returns a flat list of
    FeatureRecord objects — one per feature value (or one per target object
    when a ``selector`` is provided).

    Each returned record is a dynamically-generated Pydantic model with
    typed fields matching the feature's definition. For example, an
    ``Image_Classification`` feature might produce records with fields
    ``Image`` (str), ``Image_Class`` (str), ``Execution`` (str),
    ``RCT`` (str), and ``Feature_Name`` (str).

    Args:
        table: The table the feature is defined on (name or Table object).
        feature_name: Name of the feature to retrieve values for.
        selector: Optional function to resolve multiple values per target.
            See ``fetch_table_features`` for details on how selectors work.
            Use ``FeatureRecord.select_newest`` to pick the most recently
            created value.

    Returns:
        Iterable[FeatureRecord]: FeatureRecord instances with:

        - ``Execution``: RID of the execution that created this value
        - ``Feature_Name``: Name of the feature
        - ``RCT``: Row Creation Time (ISO 8601 timestamp)
        - Feature-specific columns as typed attributes (vocabulary terms,
          asset references, or value columns depending on the feature)
        - ``model_dump()``: Convert to a dictionary

    Raises:
        DerivaMLException: If the feature doesn't exist on the table.

    Examples:
        Get typed feature records::

            >>> for record in ml.list_feature_values("Image", "Quality"):
            ...     print(f"Image {record.Image}: {record.ImageQuality}")
            ...     print(f"Created by execution: {record.Execution}")

        Select newest when multiple values exist::

            >>> records = list(ml.list_feature_values(
            ...     "Image", "Quality",
            ...     selector=FeatureRecord.select_newest,
            ... ))

        Convert to a list of dicts::

            >>> dicts = [r.model_dump() for r in
            ...          ml.list_feature_values("Image", "Classification")]
    """
    result = self.fetch_table_features(table, feature_name=feature_name, selector=selector)
    return result.get(feature_name, [])

list_files

list_files(
    file_types: list[str] | None = None,
) -> list[dict[str, Any]]

Lists files in the catalog with their metadata.

Returns a list of files with their metadata including URL, MD5 hash, length, description, and associated file types. Files can be optionally filtered by type.

Parameters:

Name Type Description Default
file_types list[str] | None

Filter results to only include these file types.

None

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: List of file records, each containing: - RID: Resource identifier - URL: File location - MD5: File hash - Length: File size - Description: File description - File_Types: List of associated file types

Examples:

List all files: >>> files = ml.list_files() >>> for f in files: ... print(f"{f['RID']}: {f['URL']}")

Filter by file type: >>> image_files = ml.list_files(["image", "png"])

Source code in src/deriva_ml/core/mixins/file.py
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
def list_files(self, file_types: list[str] | None = None) -> list[dict[str, Any]]:
    """Lists files in the catalog with their metadata.

    Returns a list of files with their metadata including URL, MD5 hash, length, description,
    and associated file types. Files can be optionally filtered by type.

    Args:
        file_types: Filter results to only include these file types.

    Returns:
        list[dict[str, Any]]: List of file records, each containing:
            - RID: Resource identifier
            - URL: File location
            - MD5: File hash
            - Length: File size
            - Description: File description
            - File_Types: List of associated file types

    Examples:
        List all files:
            >>> files = ml.list_files()
            >>> for f in files:
            ...     print(f"{f['RID']}: {f['URL']}")

        Filter by file type:
            >>> image_files = ml.list_files(["image", "png"])
    """
    asset_type_atable, file_fk, asset_type_fk = self.model.find_association("File", "Asset_Type")
    ml_path = self.pathBuilder().schemas[self.ml_schema]
    file = ml_path.File
    asset_type = ml_path.tables[asset_type_atable.name]

    path = file.path
    path = path.link(asset_type.alias("AT"), on=file.RID == asset_type.columns[file_fk], join_type="left")
    if file_types:
        path = path.filter(asset_type.columns[asset_type_fk] == datapath.Any(*file_types))
    path = path.attributes(
        path.File.RID,
        path.File.URL,
        path.File.MD5,
        path.File.Length,
        path.File.Description,
        path.AT.columns[asset_type_fk],
    )

    file_map = {}
    for f in path.fetch():
        entry = file_map.setdefault(f["RID"], {**f, "File_Types": []})
        if ft := f.get("Asset_Type"):  # assign-and-test in one go
            entry["File_Types"].append(ft)

    # Now get rid of the File_Type key and return the result
    return [(f, f.pop("Asset_Type"))[0] for f in file_map.values()]

list_foreign_keys

list_foreign_keys(
    table: str | Table,
) -> dict[str, Any]

List all foreign keys related to a table.

Returns both outbound foreign keys (from this table to others) and inbound foreign keys (from other tables to this one). Useful for determining valid constraint names for visible-columns and visible-foreign-keys annotations.

Parameters:

Name Type Description Default
table str | Table

Table name or Table object.

required

Returns:

Type Description
dict[str, Any]

Dictionary with:

dict[str, Any]
  • table: Table name
dict[str, Any]
  • outbound: List of outbound foreign keys
dict[str, Any]
  • inbound: List of inbound foreign keys
dict[str, Any]

Each foreign key contains constraint_name, from_table, from_columns,

dict[str, Any]

to_table, to_columns.

Example

fkeys = ml.list_foreign_keys("Image") for fk in fkeys["outbound"]: ... print(f"{fk['constraint_name']} -> {fk['to_table']}")

Source code in src/deriva_ml/core/mixins/annotation.py
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def list_foreign_keys(self, table: str | Table) -> dict[str, Any]:
    """List all foreign keys related to a table.

    Returns both outbound foreign keys (from this table to others) and
    inbound foreign keys (from other tables to this one). Useful for
    determining valid constraint names for visible-columns and
    visible-foreign-keys annotations.

    Args:
        table: Table name or Table object.

    Returns:
        Dictionary with:
        - table: Table name
        - outbound: List of outbound foreign keys
        - inbound: List of inbound foreign keys
        Each foreign key contains constraint_name, from_table, from_columns,
        to_table, to_columns.

    Example:
        >>> fkeys = ml.list_foreign_keys("Image")
        >>> for fk in fkeys["outbound"]:
        ...     print(f"{fk['constraint_name']} -> {fk['to_table']}")
    """
    table_obj = self.model.name_to_table(table)

    outbound = []
    for fkey in table_obj.foreign_keys:
        outbound.append({
            "constraint_name": [fkey.constraint_schema.name, fkey.constraint_name],
            "from_table": table_obj.name,
            "from_columns": [col.name for col in fkey.columns],
            "to_table": fkey.pk_table.name,
            "to_columns": [col.name for col in fkey.referenced_columns],
        })

    inbound = []
    for fkey in table_obj.referenced_by:
        inbound.append({
            "constraint_name": [fkey.constraint_schema.name, fkey.constraint_name],
            "from_table": fkey.table.name,
            "from_columns": [col.name for col in fkey.columns],
            "to_table": table_obj.name,
            "to_columns": [col.name for col in fkey.referenced_columns],
        })

    return {
        "table": table_obj.name,
        "outbound": outbound,
        "inbound": inbound,
    }

list_vocabulary_terms

list_vocabulary_terms(
    table: str | Table,
) -> list[VocabularyTerm]

Lists all terms in a vocabulary table.

Retrieves all terms, their descriptions, and synonyms from a controlled vocabulary table.

Parameters:

Name Type Description Default
table str | Table

Vocabulary table to list terms from (name or Table object).

required

Returns:

Type Description
list[VocabularyTerm]

list[VocabularyTerm]: List of vocabulary terms with their metadata.

Raises:

Type Description
DerivaMLException

If table doesn't exist or is not a vocabulary table.

Examples:

>>> terms = ml.list_vocabulary_terms("tissue_types")
>>> for term in terms:
...     print(f"{term.name}: {term.description}")
...     if term.synonyms:
...         print(f"  Synonyms: {', '.join(term.synonyms)}")
Source code in src/deriva_ml/core/mixins/vocabulary.py
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
def list_vocabulary_terms(self, table: str | Table) -> list[VocabularyTerm]:
    """Lists all terms in a vocabulary table.

    Retrieves all terms, their descriptions, and synonyms from a controlled vocabulary table.

    Args:
        table: Vocabulary table to list terms from (name or Table object).

    Returns:
        list[VocabularyTerm]: List of vocabulary terms with their metadata.

    Raises:
        DerivaMLException: If table doesn't exist or is not a vocabulary table.

    Examples:
        >>> terms = ml.list_vocabulary_terms("tissue_types")
        >>> for term in terms:
        ...     print(f"{term.name}: {term.description}")
        ...     if term.synonyms:
        ...         print(f"  Synonyms: {', '.join(term.synonyms)}")
    """
    # Get path builder and table reference
    pb = self.pathBuilder()
    table = self.model.name_to_table(table.value if isinstance(table, MLVocab) else table)

    # Validate table is a vocabulary table
    if not (self.model.is_vocabulary(table)):
        raise DerivaMLException(f"The table {table} is not a controlled vocabulary")

    # Fetch and convert all terms to VocabularyTerm objects
    return [VocabularyTerm(**v) for v in pb.schemas[table.schema.name].tables[table.name].entities().fetch()]

lookup_asset

lookup_asset(asset_rid: RID) -> 'Asset'

Look up an asset by its RID.

Returns an Asset object for the specified RID. The asset can be from any asset table in the catalog.

Parameters:

Name Type Description Default
asset_rid RID

The RID of the asset to look up.

required

Returns:

Type Description
'Asset'

Asset object for the specified RID.

Raises:

Type Description
DerivaMLException

If the RID is not found or is not an asset.

Example

asset = ml.lookup_asset("3JSE") print(f"File: {asset.filename}, Table: {asset.asset_table}")

Source code in src/deriva_ml/core/mixins/asset.py
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
def lookup_asset(self, asset_rid: RID) -> "Asset":
    """Look up an asset by its RID.

    Returns an Asset object for the specified RID. The asset can be from
    any asset table in the catalog.

    Args:
        asset_rid: The RID of the asset to look up.

    Returns:
        Asset object for the specified RID.

    Raises:
        DerivaMLException: If the RID is not found or is not an asset.

    Example:
        >>> asset = ml.lookup_asset("3JSE")
        >>> print(f"File: {asset.filename}, Table: {asset.asset_table}")
    """
    from deriva_ml.asset.asset import Asset

    # Resolve the RID to find which table it belongs to
    rid_info = self.resolve_rid(asset_rid)  # type: ignore[attr-defined]
    asset_table = rid_info.table

    if not self.model.is_asset(asset_table):
        raise DerivaMLException(f"RID {asset_rid} is not an asset (table: {asset_table.name})")

    # Query the asset table for this record
    pb = self.pathBuilder()
    asset_path = pb.schemas[asset_table.schema.name].tables[asset_table.name]

    records = list(asset_path.filter(asset_path.RID == asset_rid).entities().fetch())
    if not records:
        raise DerivaMLException(f"Asset {asset_rid} not found in table {asset_table.name}")

    record = records[0]

    # Get asset types
    asset_types = []
    try:
        type_assoc_table, asset_fk, _ = self.model.find_association(asset_table, "Asset_Type")
        type_path = pb.schemas[type_assoc_table.schema.name].tables[type_assoc_table.name]
        types = list(
            type_path.filter(type_path.columns[asset_fk] == asset_rid)
            .attributes(type_path.Asset_Type)
            .fetch()
        )
        asset_types = [t["Asset_Type"] for t in types]
    except Exception:
        pass  # No type association for this asset table

    return Asset(
        catalog=self,  # type: ignore[arg-type]
        asset_rid=asset_rid,
        asset_table=asset_table.name,
        filename=record.get("Filename", ""),
        url=record.get("URL", ""),
        length=record.get("Length", 0),
        md5=record.get("MD5", ""),
        description=record.get("Description", ""),
        asset_types=asset_types,
    )

lookup_dataset

lookup_dataset(
    dataset: RID | DatasetSpec,
    deleted: bool = False,
) -> "Dataset"

Look up a dataset by RID or DatasetSpec.

Parameters:

Name Type Description Default
dataset RID | DatasetSpec

Dataset RID or DatasetSpec to look up.

required
deleted bool

If True, include datasets that have been marked as deleted.

False

Returns:

Name Type Description
Dataset 'Dataset'

The dataset object for the specified RID.

Raises:

Type Description
DerivaMLException

If the dataset is not found.

Example

dataset = ml.lookup_dataset("4HM") print(f"Version: {dataset.current_version}")

Source code in src/deriva_ml/core/mixins/dataset.py
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
def lookup_dataset(self, dataset: RID | DatasetSpec, deleted: bool = False) -> "Dataset":
    """Look up a dataset by RID or DatasetSpec.

    Args:
        dataset: Dataset RID or DatasetSpec to look up.
        deleted: If True, include datasets that have been marked as deleted.

    Returns:
        Dataset: The dataset object for the specified RID.

    Raises:
        DerivaMLException: If the dataset is not found.

    Example:
        >>> dataset = ml.lookup_dataset("4HM")
        >>> print(f"Version: {dataset.current_version}")
    """
    if isinstance(dataset, DatasetSpec):
        dataset_rid = dataset.rid
    else:
        dataset_rid = dataset

    try:
        return [ds for ds in self.find_datasets(deleted=deleted) if ds.dataset_rid == dataset_rid][0]
    except IndexError:
        raise DerivaMLException(f"Dataset {dataset_rid} not found.")

lookup_execution

lookup_execution(
    execution_rid: RID,
) -> "ExecutionRecord"

Look up an execution by RID and return an ExecutionRecord.

Creates an ExecutionRecord object for querying and modifying execution metadata. The ExecutionRecord provides access to the catalog record state and allows updating mutable properties like status and description.

For running computations with datasets and assets, use restore_execution() or create_execution() which return full Execution objects.

Parameters:

Name Type Description Default
execution_rid RID

Resource Identifier (RID) of the execution.

required

Returns:

Name Type Description
ExecutionRecord 'ExecutionRecord'

An execution record object bound to the catalog.

Raises:

Type Description
DerivaMLException

If execution_rid is not valid or doesn't refer to an Execution record.

Example

Look up an execution and query its state::

>>> record = ml.lookup_execution("1-abc123")
>>> print(f"Status: {record.status}")
>>> print(f"Description: {record.description}")

Update mutable properties::

>>> record.status = Status.completed
>>> record.description = "Analysis finished"

Query relationships::

>>> children = list(record.list_nested_executions())
>>> parents = list(record.list_parent_executions())
Source code in src/deriva_ml/core/mixins/execution.py
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
def lookup_execution(self, execution_rid: RID) -> "ExecutionRecord":
    """Look up an execution by RID and return an ExecutionRecord.

    Creates an ExecutionRecord object for querying and modifying execution
    metadata. The ExecutionRecord provides access to the catalog record
    state and allows updating mutable properties like status and description.

    For running computations with datasets and assets, use ``restore_execution()``
    or ``create_execution()`` which return full Execution objects.

    Args:
        execution_rid: Resource Identifier (RID) of the execution.

    Returns:
        ExecutionRecord: An execution record object bound to the catalog.

    Raises:
        DerivaMLException: If execution_rid is not valid or doesn't refer
            to an Execution record.

    Example:
        Look up an execution and query its state::

            >>> record = ml.lookup_execution("1-abc123")
            >>> print(f"Status: {record.status}")
            >>> print(f"Description: {record.description}")

        Update mutable properties::

            >>> record.status = Status.completed
            >>> record.description = "Analysis finished"

        Query relationships::

            >>> children = list(record.list_nested_executions())
            >>> parents = list(record.list_parent_executions())
    """
    # Import here to avoid circular dependency
    from deriva_ml.execution.execution_record import ExecutionRecord

    # Get execution record from catalog and verify it's an Execution
    resolved = self.resolve_rid(execution_rid)
    if resolved.table.name != "Execution":
        raise DerivaMLException(
            f"RID '{execution_rid}' refers to a {resolved.table.name}, not an Execution"
        )

    execution_data = self.retrieve_rid(execution_rid)

    # Parse timestamps if present
    start_time = None
    stop_time = None
    if execution_data.get("Start"):
        from datetime import datetime
        try:
            start_time = datetime.fromisoformat(execution_data["Start"].replace("Z", "+00:00"))
        except (ValueError, AttributeError):
            pass
    if execution_data.get("Stop"):
        from datetime import datetime
        try:
            stop_time = datetime.fromisoformat(execution_data["Stop"].replace("Z", "+00:00"))
        except (ValueError, AttributeError):
            pass

    # Look up the workflow if present
    workflow_rid = execution_data.get("Workflow")
    workflow = self.lookup_workflow(workflow_rid) if workflow_rid else None

    # Create ExecutionRecord bound to this catalog
    record = ExecutionRecord(
        execution_rid=execution_rid,
        workflow=workflow,
        status=Status(execution_data.get("Status", "Created")),
        description=execution_data.get("Description"),
        start_time=start_time,
        stop_time=stop_time,
        duration=execution_data.get("Duration"),
        _ml_instance=self,
        _logger=getattr(self, "_logger", None),
    )

    return record

lookup_experiment

lookup_experiment(
    execution_rid: RID,
) -> "Experiment"

Look up an experiment by execution RID.

Creates an Experiment object for analyzing completed executions. Provides convenient access to execution metadata, configuration choices, model parameters, inputs, and outputs.

Parameters:

Name Type Description Default
execution_rid RID

Resource Identifier (RID) of the execution.

required

Returns:

Name Type Description
Experiment 'Experiment'

An experiment object for the given execution RID.

Example

exp = ml.lookup_experiment("47BE") print(exp.name) # e.g., "cifar10_quick" print(exp.config_choices) # Hydra config names used print(exp.model_config) # Model hyperparameters

Source code in src/deriva_ml/core/mixins/execution.py
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
def lookup_experiment(self, execution_rid: RID) -> "Experiment":
    """Look up an experiment by execution RID.

    Creates an Experiment object for analyzing completed executions.
    Provides convenient access to execution metadata, configuration choices,
    model parameters, inputs, and outputs.

    Args:
        execution_rid: Resource Identifier (RID) of the execution.

    Returns:
        Experiment: An experiment object for the given execution RID.

    Example:
        >>> exp = ml.lookup_experiment("47BE")
        >>> print(exp.name)  # e.g., "cifar10_quick"
        >>> print(exp.config_choices)  # Hydra config names used
        >>> print(exp.model_config)  # Model hyperparameters
    """
    from deriva_ml.experiment import Experiment

    return Experiment(self, execution_rid)  # type: ignore[arg-type]

lookup_feature

lookup_feature(
    table: str | Table,
    feature_name: str,
) -> Feature

Look up a feature definition by table and name.

Returns a Feature object that describes the schema structure of a feature — not the feature values themselves. A Feature is a schema-level descriptor derived by inspecting the catalog's association tables. It tells you:

  • What table the feature annotates (target_table) — e.g., Image
  • Where values are stored (feature_table) — the association table linking targets to values and executions
  • What kind of values it holds, classified by column role:

  • term_columns: columns referencing controlled vocabulary tables (e.g., a Diagnosis_Type column pointing to a vocabulary of diagnosis terms)

  • asset_columns: columns referencing asset tables (e.g., a Segmentation_Mask column)
  • value_columns: columns holding direct values like floats, ints, or text (e.g., a confidence score)

The Feature object also provides feature_record_class(), which returns a dynamically generated Pydantic model for constructing validated feature records to insert into the catalog.

To retrieve actual feature values, use fetch_table_features or list_feature_values instead.

Parameters:

Name Type Description Default
table str | Table

The table the feature is defined on (name or Table object).

required
feature_name str

Name of the feature to look up.

required

Returns:

Type Description
Feature

A Feature schema descriptor.

Raises:

Type Description
DerivaMLException

If the feature doesn't exist on the specified table.

Example

feature = ml.lookup_feature("Image", "Classification") print(f"Feature: {feature.feature_name}") print(f"Stored in: {feature.feature_table.name}") print(f"Term columns: {[c.name for c in feature.term_columns]}") print(f"Value columns: {[c.name for c in feature.value_columns]}")

Source code in src/deriva_ml/core/mixins/feature.py
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
def lookup_feature(self, table: str | Table, feature_name: str) -> Feature:
    """Look up a feature definition by table and name.

    Returns a ``Feature`` object that describes the **schema structure**
    of a feature — not the feature values themselves. A Feature is a
    schema-level descriptor derived by inspecting the catalog's
    association tables. It tells you:

    - **What table the feature annotates** (``target_table``) — e.g., Image
    - **Where values are stored** (``feature_table``) — the association
      table linking targets to values and executions
    - **What kind of values it holds**, classified by column role:

      - ``term_columns``: columns referencing controlled vocabulary
        tables (e.g., a ``Diagnosis_Type`` column pointing to a
        vocabulary of diagnosis terms)
      - ``asset_columns``: columns referencing asset tables (e.g., a
        ``Segmentation_Mask`` column)
      - ``value_columns``: columns holding direct values like floats,
        ints, or text (e.g., a ``confidence`` score)

    The Feature object also provides ``feature_record_class()``, which
    returns a dynamically generated Pydantic model for constructing
    validated feature records to insert into the catalog.

    To retrieve actual feature **values**, use ``fetch_table_features``
    or ``list_feature_values`` instead.

    Args:
        table: The table the feature is defined on (name or Table object).
        feature_name: Name of the feature to look up.

    Returns:
        A Feature schema descriptor.

    Raises:
        DerivaMLException: If the feature doesn't exist on the specified
            table.

    Example:
        >>> feature = ml.lookup_feature("Image", "Classification")
        >>> print(f"Feature: {feature.feature_name}")
        >>> print(f"Stored in: {feature.feature_table.name}")
        >>> print(f"Term columns: {[c.name for c in feature.term_columns]}")
        >>> print(f"Value columns: {[c.name for c in feature.value_columns]}")
    """
    return self.model.lookup_feature(table, feature_name)

lookup_term

lookup_term(
    table: str | Table, term_name: str
) -> VocabularyTermHandle

Finds a term in a vocabulary table.

Searches for a term in the specified vocabulary table, matching either the primary name or any of its synonyms. Results are cached for performance - subsequent lookups in the same vocabulary table are served from cache.

Parameters:

Name Type Description Default
table str | Table

Vocabulary table to search in (name or Table object).

required
term_name str

Name or synonym of the term to find.

required

Returns:

Name Type Description
VocabularyTermHandle VocabularyTermHandle

The matching vocabulary term, with methods to modify it.

Raises:

Type Description
DerivaMLVocabularyException

If the table is not a vocabulary table, or term is not found.

Examples:

Look up by primary name: >>> term = ml.lookup_term("tissue_types", "epithelial") >>> print(term.description)

Look up by synonym: >>> term = ml.lookup_term("tissue_types", "epithelium")

Modify the term: >>> term = ml.lookup_term("tissue_types", "epithelial") >>> term.description = "Updated description" >>> term.synonyms = ("epithelium", "epithelial_tissue")

Source code in src/deriva_ml/core/mixins/vocabulary.py
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def lookup_term(self, table: str | Table, term_name: str) -> VocabularyTermHandle:
    """Finds a term in a vocabulary table.

    Searches for a term in the specified vocabulary table, matching either the primary name
    or any of its synonyms. Results are cached for performance - subsequent lookups in the
    same vocabulary table are served from cache.

    Args:
        table: Vocabulary table to search in (name or Table object).
        term_name: Name or synonym of the term to find.

    Returns:
        VocabularyTermHandle: The matching vocabulary term, with methods to modify it.

    Raises:
        DerivaMLVocabularyException: If the table is not a vocabulary table, or term is not found.

    Examples:
        Look up by primary name:
            >>> term = ml.lookup_term("tissue_types", "epithelial")
            >>> print(term.description)

        Look up by synonym:
            >>> term = ml.lookup_term("tissue_types", "epithelium")

        Modify the term:
            >>> term = ml.lookup_term("tissue_types", "epithelial")
            >>> term.description = "Updated description"
            >>> term.synonyms = ("epithelium", "epithelial_tissue")
    """
    # Get and validate vocabulary table reference
    vocab_table = self.model.name_to_table(table)
    if not self.model.is_vocabulary(vocab_table):
        raise DerivaMLException(f"The table {table} is not a controlled vocabulary")

    # Get schema and table names
    schema_name, table_name = vocab_table.schema.name, vocab_table.name
    cache_key = (schema_name, table_name)

    # Check cache first
    cache = self._get_vocab_cache()
    if cache_key in cache:
        term_lookup = cache[cache_key]
        if term_name in term_lookup:
            return term_lookup[term_name]
        # Term not in cache - might be newly added, try server-side lookup
    else:
        # Vocabulary not cached yet - try server-side lookup first for single term
        term = self._server_lookup_term(schema_name, table_name, term_name)
        if term is not None:
            # Found it - populate the full cache for future lookups
            self._populate_vocab_cache(schema_name, table_name)
            return self._get_vocab_cache()[cache_key][term_name]
        # Not found by name - need to check synonyms, populate cache
        term_lookup = self._populate_vocab_cache(schema_name, table_name)
        if term_name in term_lookup:
            return term_lookup[term_name]
        raise DerivaMLInvalidTerm(table_name, term_name)

    # Term not in cache - try server-side lookup (might be newly added)
    term = self._server_lookup_term(schema_name, table_name, term_name)
    if term is not None:
        # Refresh cache to get the VocabularyTermHandle
        self._populate_vocab_cache(schema_name, table_name)
        return self._get_vocab_cache()[cache_key][term_name]

    # Still not found - refresh cache and try one more time
    term_lookup = self._populate_vocab_cache(schema_name, table_name)
    if term_name in term_lookup:
        return term_lookup[term_name]

    # Term not found
    raise DerivaMLInvalidTerm(table_name, term_name)

lookup_workflow

lookup_workflow(rid: RID) -> Workflow

Look up a workflow by its Resource Identifier (RID).

Retrieves a workflow from the catalog by its RID and returns a Workflow object bound to the catalog. The returned Workflow can be modified (e.g., updating its description) and changes will be reflected in the catalog.

Parameters:

Name Type Description Default
rid RID

Resource Identifier of the workflow to look up.

required

Returns:

Name Type Description
Workflow Workflow

The workflow object bound to this catalog, allowing properties like description to be updated.

Raises:

Type Description
DerivaMLException

If the RID does not correspond to a workflow in the catalog.

Examples:

Look up a workflow and read its properties::

>>> workflow = ml.lookup_workflow("2-ABC1")
>>> print(f"Name: {workflow.name}")
>>> print(f"Description: {workflow.description}")
>>> print(f"Type: {workflow.workflow_type}")

Update a workflow's description (persisted to catalog)::

>>> workflow = ml.lookup_workflow("2-ABC1")
>>> workflow.description = "Updated analysis pipeline for RNA sequences"
>>> # The change is immediately written to the catalog

Attempting to update on a read-only catalog raises an error::

>>> snapshot = ml.catalog_snapshot("2023-01-15T10:30:00")
>>> workflow = snapshot.lookup_workflow("2-ABC1")
>>> workflow.description = "New description"
DerivaMLException: Cannot update workflow description on a read-only
    catalog snapshot. Use a writable catalog connection instead.
Source code in src/deriva_ml/core/mixins/workflow.py
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
def lookup_workflow(self, rid: RID) -> Workflow:
    """Look up a workflow by its Resource Identifier (RID).

    Retrieves a workflow from the catalog by its RID and returns a Workflow
    object bound to the catalog. The returned Workflow can be modified (e.g.,
    updating its description) and changes will be reflected in the catalog.

    Args:
        rid: Resource Identifier of the workflow to look up.

    Returns:
        Workflow: The workflow object bound to this catalog, allowing
            properties like ``description`` to be updated.

    Raises:
        DerivaMLException: If the RID does not correspond to a workflow
            in the catalog.

    Examples:
        Look up a workflow and read its properties::

            >>> workflow = ml.lookup_workflow("2-ABC1")
            >>> print(f"Name: {workflow.name}")
            >>> print(f"Description: {workflow.description}")
            >>> print(f"Type: {workflow.workflow_type}")

        Update a workflow's description (persisted to catalog)::

            >>> workflow = ml.lookup_workflow("2-ABC1")
            >>> workflow.description = "Updated analysis pipeline for RNA sequences"
            >>> # The change is immediately written to the catalog

        Attempting to update on a read-only catalog raises an error::

            >>> snapshot = ml.catalog_snapshot("2023-01-15T10:30:00")
            >>> workflow = snapshot.lookup_workflow("2-ABC1")
            >>> workflow.description = "New description"
            DerivaMLException: Cannot update workflow description on a read-only
                catalog snapshot. Use a writable catalog connection instead.
    """
    # Get the workflow table path
    workflow_path = self.pathBuilder().schemas[self.ml_schema].Workflow

    # Filter by RID
    records = list(workflow_path.filter(workflow_path.RID == rid).entities().fetch())

    if not records:
        raise DerivaMLException(f"Workflow with RID '{rid}' not found in the catalog")

    w = records[0]
    workflow_types = self._get_workflow_types_for_rid(w["RID"])
    workflow = Workflow(
        name=w["Name"],
        url=w["URL"],
        workflow_type=workflow_types,
        version=w["Version"],
        description=w["Description"],
        rid=w["RID"],
        checksum=w["Checksum"],
    )
    # Bind the workflow to this catalog instance for write-back support
    workflow._ml_instance = self  # type: ignore[assignment]
    return workflow

lookup_workflow_by_url

lookup_workflow_by_url(
    url_or_checksum: str,
) -> Workflow

Look up a workflow by URL or checksum and return the full Workflow object.

Searches for a workflow in the catalog that matches the given URL or checksum and returns a Workflow object bound to the catalog. This allows you to both identify a workflow by its source code location and modify its properties (e.g., description).

The URL should be a GitHub URL pointing to the specific version of the workflow source code. The format typically includes the commit hash::

https://github.com/org/repo/blob/<commit_hash>/path/to/workflow.py

Alternatively, you can search by the Git object hash (checksum) of the workflow file.

Parameters:

Name Type Description Default
url_or_checksum str

GitHub URL with commit hash, or Git object hash (checksum) of the workflow file.

required

Returns:

Name Type Description
Workflow Workflow

The workflow object bound to this catalog, allowing properties like description to be updated.

Raises:

Type Description
DerivaMLException

If no workflow with the given URL or checksum is found in the catalog.

Examples:

Look up a workflow by its GitHub URL::

>>> url = "https://github.com/org/repo/blob/abc123/analysis.py"
>>> workflow = ml.lookup_workflow_by_url(url)
>>> print(f"Found: {workflow.name}")
>>> print(f"Version: {workflow.version}")

Look up by Git object hash (checksum)::

>>> workflow = ml.lookup_workflow_by_url("abc123def456789...")
>>> print(f"Name: {workflow.name}")
>>> print(f"URL: {workflow.url}")

Update the workflow's description after lookup::

>>> workflow = ml.lookup_workflow_by_url(url)
>>> workflow.description = "Updated analysis pipeline"
>>> # The change is persisted to the catalog

Typical GitHub URL formats supported::

# Full blob URL with commit hash
https://github.com/org/repo/blob/abc123def/src/workflow.py

# The URL is matched exactly, so ensure it matches what was
# recorded when the workflow was registered
Source code in src/deriva_ml/core/mixins/workflow.py
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
def lookup_workflow_by_url(self, url_or_checksum: str) -> Workflow:
    """Look up a workflow by URL or checksum and return the full Workflow object.

    Searches for a workflow in the catalog that matches the given URL or
    checksum and returns a Workflow object bound to the catalog. This allows
    you to both identify a workflow by its source code location and modify
    its properties (e.g., description).

    The URL should be a GitHub URL pointing to the specific version of the
    workflow source code. The format typically includes the commit hash::

        https://github.com/org/repo/blob/<commit_hash>/path/to/workflow.py

    Alternatively, you can search by the Git object hash (checksum) of the
    workflow file.

    Args:
        url_or_checksum: GitHub URL with commit hash, or Git object hash
            (checksum) of the workflow file.

    Returns:
        Workflow: The workflow object bound to this catalog, allowing
            properties like ``description`` to be updated.

    Raises:
        DerivaMLException: If no workflow with the given URL or checksum
            is found in the catalog.

    Examples:
        Look up a workflow by its GitHub URL::

            >>> url = "https://github.com/org/repo/blob/abc123/analysis.py"
            >>> workflow = ml.lookup_workflow_by_url(url)
            >>> print(f"Found: {workflow.name}")
            >>> print(f"Version: {workflow.version}")

        Look up by Git object hash (checksum)::

            >>> workflow = ml.lookup_workflow_by_url("abc123def456789...")
            >>> print(f"Name: {workflow.name}")
            >>> print(f"URL: {workflow.url}")

        Update the workflow's description after lookup::

            >>> workflow = ml.lookup_workflow_by_url(url)
            >>> workflow.description = "Updated analysis pipeline"
            >>> # The change is persisted to the catalog

        Typical GitHub URL formats supported::

            # Full blob URL with commit hash
            https://github.com/org/repo/blob/abc123def/src/workflow.py

            # The URL is matched exactly, so ensure it matches what was
            # recorded when the workflow was registered
    """
    # Find the RID first
    rid = self._find_workflow_rid_by_url(url_or_checksum)
    if rid is None:
        raise DerivaMLException(
            f"Workflow with URL or checksum '{url_or_checksum}' not found in the catalog"
        )

    # Use lookup_workflow to get the full object with catalog binding
    return self.lookup_workflow(rid)

pathBuilder

pathBuilder() -> SchemaWrapper

Returns catalog path builder for queries.

The path builder provides a fluent interface for constructing complex queries against the catalog. This is a core component used by many other methods to interact with the catalog.

Returns:

Type Description
SchemaWrapper

datapath._CatalogWrapper: A new instance of the catalog path builder.

Example

path = ml.pathBuilder.schemas['my_schema'].tables['my_table'] results = path.entities().fetch()

Source code in src/deriva_ml/core/mixins/path_builder.py
55
56
57
58
59
60
61
62
63
64
65
66
67
68
def pathBuilder(self) -> SchemaWrapper:
    """Returns catalog path builder for queries.

    The path builder provides a fluent interface for constructing complex queries against the catalog.
    This is a core component used by many other methods to interact with the catalog.

    Returns:
        datapath._CatalogWrapper: A new instance of the catalog path builder.

    Example:
        >>> path = ml.pathBuilder.schemas['my_schema'].tables['my_table']
        >>> results = path.entities().fetch()
    """
    return self.catalog.getPathBuilder()

prefetch_dataset

prefetch_dataset(
    dataset: "DatasetSpec",
    materialize: bool = True,
) -> dict[str, Any]

Deprecated: Use cache_dataset() instead.

Source code in src/deriva_ml/core/mixins/dataset.py
343
344
345
def prefetch_dataset(self, dataset: "DatasetSpec", materialize: bool = True) -> dict[str, Any]:
    """Deprecated: Use cache_dataset() instead."""
    return self.cache_dataset(dataset, materialize)

remove_visible_column

remove_visible_column(
    table: str | Table,
    context: str,
    column: str | list[str] | int,
) -> list[Any]

Remove a column from the visible-columns list for a specific context.

Convenience method for removing columns without replacing the entire visible-columns annotation. Changes are staged until apply_annotations() is called.

Parameters:

Name Type Description Default
table str | Table

Table name or Table object.

required
context str

The context to modify (e.g., "compact", "detailed").

required
column str | list[str] | int

Column to remove. Can be: - String: column name to find and remove - List: foreign key reference [schema, constraint] to find and remove - Integer: index position to remove (0-indexed)

required

Returns:

Type Description
list[Any]

The updated column list for the context.

Raises:

Type Description
DerivaMLException

If annotation or context doesn't exist, or column not found.

Example

ml.remove_visible_column("Image", "compact", "Description") ml.remove_visible_column("Image", "compact", 0) # Remove first column ml.apply_annotations()

Source code in src/deriva_ml/core/mixins/annotation.py
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def remove_visible_column(
    self,
    table: str | Table,
    context: str,
    column: str | list[str] | int,
) -> list[Any]:
    """Remove a column from the visible-columns list for a specific context.

    Convenience method for removing columns without replacing the entire
    visible-columns annotation. Changes are staged until apply_annotations()
    is called.

    Args:
        table: Table name or Table object.
        context: The context to modify (e.g., "compact", "detailed").
        column: Column to remove. Can be:
            - String: column name to find and remove
            - List: foreign key reference [schema, constraint] to find and remove
            - Integer: index position to remove (0-indexed)

    Returns:
        The updated column list for the context.

    Raises:
        DerivaMLException: If annotation or context doesn't exist, or column not found.

    Example:
        >>> ml.remove_visible_column("Image", "compact", "Description")
        >>> ml.remove_visible_column("Image", "compact", 0)  # Remove first column
        >>> ml.apply_annotations()
    """
    table_obj = self.model.name_to_table(table)

    # Get visible_columns annotation
    visible_cols = table_obj.annotations.get(VISIBLE_COLUMNS_TAG, {})
    if not visible_cols:
        raise DerivaMLException(f"Table '{table_obj.name}' has no visible-columns annotation.")

    # Get the context list
    context_list = visible_cols.get(context)
    if context_list is None:
        raise DerivaMLException(f"Context '{context}' not found in visible-columns annotation.")
    if isinstance(context_list, str):
        raise DerivaMLException(
            f"Context '{context}' references another context '{context_list}'. "
            "Set it explicitly first with set_visible_columns()."
        )

    # Make a copy
    context_list = list(context_list)
    removed = None

    # Remove by index or by value
    if isinstance(column, int):
        if 0 <= column < len(context_list):
            removed = context_list.pop(column)
        else:
            raise DerivaMLException(
                f"Index {column} out of range (list has {len(context_list)} items)."
            )
    else:
        # Find and remove the column
        for i, item in enumerate(context_list):
            if item == column:
                removed = context_list.pop(i)
                break
            # Also check if it's a pseudo-column with matching source
            if isinstance(item, dict) and isinstance(column, str):
                if item.get("source") == column:
                    removed = context_list.pop(i)
                    break

        if removed is None:
            raise DerivaMLException(f"Column {column!r} not found in context '{context}'.")

    # Update the annotation
    visible_cols[context] = context_list
    table_obj.annotations[VISIBLE_COLUMNS_TAG] = visible_cols

    return context_list

remove_visible_foreign_key

remove_visible_foreign_key(
    table: str | Table,
    context: str,
    foreign_key: list[str] | int,
) -> list[Any]

Remove a foreign key from the visible-foreign-keys list for a specific context.

Convenience method for removing related tables without replacing the entire visible-foreign-keys annotation. Changes are staged until apply_annotations() is called.

Parameters:

Name Type Description Default
table str | Table

Table name or Table object.

required
context str

The context to modify (e.g., "detailed", "*").

required
foreign_key list[str] | int

Foreign key to remove. Can be: - List: foreign key reference [schema, constraint] to find and remove - Integer: index position to remove (0-indexed)

required

Returns:

Type Description
list[Any]

The updated foreign key list for the context.

Raises:

Type Description
DerivaMLException

If annotation or context doesn't exist, or foreign key not found.

Example

ml.remove_visible_foreign_key("Subject", "detailed", ["domain", "Image_Subject_fkey"]) ml.remove_visible_foreign_key("Subject", "detailed", 0) # Remove first ml.apply_annotations()

Source code in src/deriva_ml/core/mixins/annotation.py
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def remove_visible_foreign_key(
    self,
    table: str | Table,
    context: str,
    foreign_key: list[str] | int,
) -> list[Any]:
    """Remove a foreign key from the visible-foreign-keys list for a specific context.

    Convenience method for removing related tables without replacing the entire
    visible-foreign-keys annotation. Changes are staged until apply_annotations()
    is called.

    Args:
        table: Table name or Table object.
        context: The context to modify (e.g., "detailed", "*").
        foreign_key: Foreign key to remove. Can be:
            - List: foreign key reference [schema, constraint] to find and remove
            - Integer: index position to remove (0-indexed)

    Returns:
        The updated foreign key list for the context.

    Raises:
        DerivaMLException: If annotation or context doesn't exist, or foreign key not found.

    Example:
        >>> ml.remove_visible_foreign_key("Subject", "detailed", ["domain", "Image_Subject_fkey"])
        >>> ml.remove_visible_foreign_key("Subject", "detailed", 0)  # Remove first
        >>> ml.apply_annotations()
    """
    table_obj = self.model.name_to_table(table)

    # Get visible_foreign_keys annotation
    visible_fkeys = table_obj.annotations.get(VISIBLE_FOREIGN_KEYS_TAG, {})
    if not visible_fkeys:
        raise DerivaMLException(
            f"Table '{table_obj.name}' has no visible-foreign-keys annotation."
        )

    # Get the context list
    context_list = visible_fkeys.get(context)
    if context_list is None:
        raise DerivaMLException(
            f"Context '{context}' not found in visible-foreign-keys annotation."
        )
    if isinstance(context_list, str):
        raise DerivaMLException(
            f"Context '{context}' references another context '{context_list}'. "
            "Set it explicitly first with set_visible_foreign_keys()."
        )

    # Make a copy
    context_list = list(context_list)
    removed = None

    # Remove by index or by value
    if isinstance(foreign_key, int):
        if 0 <= foreign_key < len(context_list):
            removed = context_list.pop(foreign_key)
        else:
            raise DerivaMLException(
                f"Index {foreign_key} out of range (list has {len(context_list)} items)."
            )
    else:
        # Find and remove the foreign key
        for i, item in enumerate(context_list):
            if item == foreign_key:
                removed = context_list.pop(i)
                break

        if removed is None:
            raise DerivaMLException(
                f"Foreign key {foreign_key!r} not found in context '{context}'."
            )

    # Update the annotation
    visible_fkeys[context] = context_list
    table_obj.annotations[VISIBLE_FOREIGN_KEYS_TAG] = visible_fkeys

    return context_list

reorder_visible_columns

reorder_visible_columns(
    table: str | Table,
    context: str,
    new_order: list[int]
    | list[
        str | list[str] | dict[str, Any]
    ],
) -> list[Any]

Reorder columns in the visible-columns list for a specific context.

Convenience method for reordering columns without manually reconstructing the list. Changes are staged until apply_annotations() is called.

Parameters:

Name Type Description Default
table str | Table

Table name or Table object.

required
context str

The context to modify (e.g., "compact", "detailed").

required
new_order list[int] | list[str | list[str] | dict[str, Any]]

The new order specification. Can be: - List of indices: [2, 0, 1, 3] reorders by current positions - List of column specs: ["Name", "RID", ...] specifies exact order

required

Returns:

Type Description
list[Any]

The reordered column list.

Raises:

Type Description
DerivaMLException

If annotation or context doesn't exist, or invalid order.

Example

ml.reorder_visible_columns("Image", "compact", [2, 0, 1, 3, 4]) ml.reorder_visible_columns("Image", "compact", ["Filename", "Subject", "RID"]) ml.apply_annotations()

Source code in src/deriva_ml/core/mixins/annotation.py
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def reorder_visible_columns(
    self,
    table: str | Table,
    context: str,
    new_order: list[int] | list[str | list[str] | dict[str, Any]],
) -> list[Any]:
    """Reorder columns in the visible-columns list for a specific context.

    Convenience method for reordering columns without manually reconstructing
    the list. Changes are staged until apply_annotations() is called.

    Args:
        table: Table name or Table object.
        context: The context to modify (e.g., "compact", "detailed").
        new_order: The new order specification. Can be:
            - List of indices: [2, 0, 1, 3] reorders by current positions
            - List of column specs: ["Name", "RID", ...] specifies exact order

    Returns:
        The reordered column list.

    Raises:
        DerivaMLException: If annotation or context doesn't exist, or invalid order.

    Example:
        >>> ml.reorder_visible_columns("Image", "compact", [2, 0, 1, 3, 4])
        >>> ml.reorder_visible_columns("Image", "compact", ["Filename", "Subject", "RID"])
        >>> ml.apply_annotations()
    """
    table_obj = self.model.name_to_table(table)

    # Get visible_columns annotation
    visible_cols = table_obj.annotations.get(VISIBLE_COLUMNS_TAG, {})
    if not visible_cols:
        raise DerivaMLException(f"Table '{table_obj.name}' has no visible-columns annotation.")

    # Get the context list
    context_list = visible_cols.get(context)
    if context_list is None:
        raise DerivaMLException(f"Context '{context}' not found in visible-columns annotation.")
    if isinstance(context_list, str):
        raise DerivaMLException(
            f"Context '{context}' references another context '{context_list}'. "
            "Set it explicitly first with set_visible_columns()."
        )

    original_list = list(context_list)

    # Determine if new_order is indices or column specs
    if new_order and isinstance(new_order[0], int):
        # Reorder by indices
        if len(new_order) != len(original_list):
            raise DerivaMLException(
                f"Index list length ({len(new_order)}) must match "
                f"current list length ({len(original_list)})."
            )
        if set(new_order) != set(range(len(original_list))):
            raise DerivaMLException("Index list must contain each index exactly once.")
        new_list = [original_list[i] for i in new_order]
    else:
        # new_order is the exact new column list
        new_list = list(new_order)

    # Update the annotation
    visible_cols[context] = new_list
    table_obj.annotations[VISIBLE_COLUMNS_TAG] = visible_cols

    return new_list

reorder_visible_foreign_keys

reorder_visible_foreign_keys(
    table: str | Table,
    context: str,
    new_order: list[int]
    | list[list[str] | dict[str, Any]],
) -> list[Any]

Reorder foreign keys in the visible-foreign-keys list for a specific context.

Convenience method for reordering related tables without manually reconstructing the list. Changes are staged until apply_annotations() is called.

Parameters:

Name Type Description Default
table str | Table

Table name or Table object.

required
context str

The context to modify (e.g., "detailed", "*").

required
new_order list[int] | list[list[str] | dict[str, Any]]

The new order specification. Can be: - List of indices: [2, 0, 1] reorders by current positions - List of foreign key refs: [["schema", "fkey1"], ...] specifies exact order

required

Returns:

Type Description
list[Any]

The reordered foreign key list.

Raises:

Type Description
DerivaMLException

If annotation or context doesn't exist, or invalid order.

Example

ml.reorder_visible_foreign_keys("Subject", "detailed", [2, 0, 1]) ml.apply_annotations()

Source code in src/deriva_ml/core/mixins/annotation.py
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def reorder_visible_foreign_keys(
    self,
    table: str | Table,
    context: str,
    new_order: list[int] | list[list[str] | dict[str, Any]],
) -> list[Any]:
    """Reorder foreign keys in the visible-foreign-keys list for a specific context.

    Convenience method for reordering related tables without manually
    reconstructing the list. Changes are staged until apply_annotations()
    is called.

    Args:
        table: Table name or Table object.
        context: The context to modify (e.g., "detailed", "*").
        new_order: The new order specification. Can be:
            - List of indices: [2, 0, 1] reorders by current positions
            - List of foreign key refs: [["schema", "fkey1"], ...] specifies exact order

    Returns:
        The reordered foreign key list.

    Raises:
        DerivaMLException: If annotation or context doesn't exist, or invalid order.

    Example:
        >>> ml.reorder_visible_foreign_keys("Subject", "detailed", [2, 0, 1])
        >>> ml.apply_annotations()
    """
    table_obj = self.model.name_to_table(table)

    # Get visible_foreign_keys annotation
    visible_fkeys = table_obj.annotations.get(VISIBLE_FOREIGN_KEYS_TAG, {})
    if not visible_fkeys:
        raise DerivaMLException(
            f"Table '{table_obj.name}' has no visible-foreign-keys annotation."
        )

    # Get the context list
    context_list = visible_fkeys.get(context)
    if context_list is None:
        raise DerivaMLException(
            f"Context '{context}' not found in visible-foreign-keys annotation."
        )
    if isinstance(context_list, str):
        raise DerivaMLException(
            f"Context '{context}' references another context '{context_list}'. "
            "Set it explicitly first with set_visible_foreign_keys()."
        )

    original_list = list(context_list)

    # Determine if new_order is indices or foreign key specs
    if new_order and isinstance(new_order[0], int):
        # Reorder by indices
        if len(new_order) != len(original_list):
            raise DerivaMLException(
                f"Index list length ({len(new_order)}) must match "
                f"current list length ({len(original_list)})."
            )
        if set(new_order) != set(range(len(original_list))):
            raise DerivaMLException("Index list must contain each index exactly once.")
        new_list = [original_list[i] for i in new_order]
    else:
        # new_order is the exact new foreign key list
        new_list = list(new_order)

    # Update the annotation
    visible_fkeys[context] = new_list
    table_obj.annotations[VISIBLE_FOREIGN_KEYS_TAG] = visible_fkeys

    return new_list

resolve_rid

resolve_rid(
    rid: RID,
) -> ResolveRidResult

Resolves RID to catalog location.

Looks up a RID and returns information about where it exists in the catalog, including schema, table, and column metadata.

Parameters:

Name Type Description Default
rid RID

Resource Identifier to resolve.

required

Returns:

Name Type Description
ResolveRidResult ResolveRidResult

Named tuple containing: - schema: Schema name - table: Table name - columns: Column definitions - datapath: Path builder for accessing the entity

Raises:

Type Description
DerivaMLException

If RID doesn't exist in catalog.

Examples:

>>> result = ml.resolve_rid("1-abc123")
>>> print(f"Found in {result.schema}.{result.table}")
>>> data = result.datapath.entities().fetch()
Source code in src/deriva_ml/core/mixins/rid_resolution.py
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
def resolve_rid(self, rid: RID) -> ResolveRidResult:
    """Resolves RID to catalog location.

    Looks up a RID and returns information about where it exists in the catalog, including schema,
    table, and column metadata.

    Args:
        rid: Resource Identifier to resolve.

    Returns:
        ResolveRidResult: Named tuple containing:
            - schema: Schema name
            - table: Table name
            - columns: Column definitions
            - datapath: Path builder for accessing the entity

    Raises:
        DerivaMLException: If RID doesn't exist in catalog.

    Examples:
        >>> result = ml.resolve_rid("1-abc123")
        >>> print(f"Found in {result.schema}.{result.table}")
        >>> data = result.datapath.entities().fetch()
    """
    try:
        # Attempt to resolve RID using catalog model
        return self.catalog.resolve_rid(rid, self.model.model)
    except KeyError as _e:
        raise DerivaMLException(f"Invalid RID {rid}")

resolve_rids

resolve_rids(
    rids: set[RID] | list[RID],
    candidate_tables: list[Table]
    | None = None,
) -> dict[RID, BatchRidResult]

Batch resolve multiple RIDs efficiently.

Resolves multiple RIDs in batched queries, significantly faster than calling resolve_rid() for each RID individually. Instead of N network calls for N RIDs, this makes one query per candidate table.

Parameters:

Name Type Description Default
rids set[RID] | list[RID]

Set or list of RIDs to resolve.

required
candidate_tables list[Table] | None

Optional list of Table objects to search in. If not provided, searches all tables in domain and ML schemas.

None

Returns:

Type Description
dict[RID, BatchRidResult]

dict[RID, BatchRidResult]: Mapping from each resolved RID to its BatchRidResult containing table information.

Raises:

Type Description
DerivaMLException

If any RID cannot be resolved.

Example

results = ml.resolve_rids(["1-ABC", "2-DEF", "3-GHI"]) for rid, info in results.items(): ... print(f"{rid} is in table {info.table_name}")

Source code in src/deriva_ml/core/mixins/rid_resolution.py
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
def resolve_rids(
    self,
    rids: set[RID] | list[RID],
    candidate_tables: list[Table] | None = None,
) -> dict[RID, BatchRidResult]:
    """Batch resolve multiple RIDs efficiently.

    Resolves multiple RIDs in batched queries, significantly faster than
    calling resolve_rid() for each RID individually. Instead of N network
    calls for N RIDs, this makes one query per candidate table.

    Args:
        rids: Set or list of RIDs to resolve.
        candidate_tables: Optional list of Table objects to search in.
            If not provided, searches all tables in domain and ML schemas.

    Returns:
        dict[RID, BatchRidResult]: Mapping from each resolved RID to its
            BatchRidResult containing table information.

    Raises:
        DerivaMLException: If any RID cannot be resolved.

    Example:
        >>> results = ml.resolve_rids(["1-ABC", "2-DEF", "3-GHI"])
        >>> for rid, info in results.items():
        ...     print(f"{rid} is in table {info.table_name}")
    """
    rids = set(rids)
    if not rids:
        return {}

    results: dict[RID, BatchRidResult] = {}
    remaining_rids = set(rids)

    # Determine which tables to search
    if candidate_tables is None:
        # Search all tables in domain and ML schemas
        candidate_tables = []
        for schema_name in [*self.model.domain_schemas, self.model.ml_schema]:
            schema = self.model.model.schemas.get(schema_name)
            if schema:
                candidate_tables.extend(schema.tables.values())

    pb = self.pathBuilder()

    # Query each candidate table for matching RIDs
    for table in candidate_tables:
        if not remaining_rids:
            break

        schema_name = table.schema.name
        table_name = table.name

        # Build a query with RID filter for all remaining RIDs
        table_path = pb.schemas[schema_name].tables[table_name]

        # Use ERMrest's Any quantifier for IN-style query
        # Query only for RID column to minimize data transfer
        try:
            # Filter: RID = any(rid1, rid2, ...) - ERMrest's way of doing IN clause
            found_entities = list(
                table_path.filter(table_path.RID == AnyQuantifier(*remaining_rids))
                .attributes(table_path.RID)
                .fetch()
            )
        except Exception as e:
            logger.debug(f"RID resolution query failed for {schema_name}.{table_name}: {e}")
            continue

        # Process found RIDs
        for entity in found_entities:
            rid = entity["RID"]
            if rid in remaining_rids:
                results[rid] = BatchRidResult(
                    rid=rid,
                    table=table,
                    table_name=table_name,
                    schema_name=schema_name,
                )
                remaining_rids.remove(rid)

    # Check if any RIDs were not found
    if remaining_rids:
        raise DerivaMLException(f"Invalid RIDs: {remaining_rids}")

    return results

restore_execution

restore_execution(
    execution_rid: RID | None = None,
) -> "Execution"

Restores a previous execution.

Given an execution RID, retrieves the execution configuration and restores the local compute environment. This routine has a number of side effects.

  1. The datasets specified in the configuration are downloaded and placed in the cache-dir. If a version is not specified in the configuration, then a new minor version number is created for the dataset and downloaded.

  2. If any execution assets are provided in the configuration, they are downloaded and placed in the working directory.

Parameters:

Name Type Description Default
execution_rid RID | None

Resource Identifier (RID) of the execution to restore.

None

Returns:

Name Type Description
Execution 'Execution'

An execution object representing the restored execution environment.

Raises:

Type Description
DerivaMLException

If execution_rid is not valid or execution cannot be restored.

Example

execution = ml.restore_execution("1-abc123")

Source code in src/deriva_ml/core/mixins/execution.py
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
def restore_execution(self, execution_rid: RID | None = None) -> "Execution":
    """Restores a previous execution.

    Given an execution RID, retrieves the execution configuration and restores the local compute environment.
    This routine has a number of side effects.

    1. The datasets specified in the configuration are downloaded and placed in the cache-dir. If a version is
    not specified in the configuration, then a new minor version number is created for the dataset and downloaded.

    2. If any execution assets are provided in the configuration, they are downloaded and placed
    in the working directory.

    Args:
        execution_rid: Resource Identifier (RID) of the execution to restore.

    Returns:
        Execution: An execution object representing the restored execution environment.

    Raises:
        DerivaMLException: If execution_rid is not valid or execution cannot be restored.

    Example:
        >>> execution = ml.restore_execution("1-abc123")
    """
    # Import here to avoid circular dependency
    from deriva_ml.execution.execution import Execution

    # If no RID provided, try to find single execution in working directory
    if not execution_rid:
        e_rids = execution_rids(self.working_dir)
        if len(e_rids) != 1:
            raise DerivaMLException(f"Multiple execution RIDs were found {e_rids}.")
        execution_rid = e_rids[0]

    # Try to load configuration from a file
    cfile = asset_file_path(
        prefix=self.working_dir,
        exec_rid=execution_rid,
        file_name="configuration.json",
        asset_table=self.model.name_to_table("Execution_Metadata"),
        metadata={},
    )

    # Load configuration from a file or create from an execution record
    if cfile.exists():
        configuration = ExecutionConfiguration.load_configuration(cfile)
    else:
        execution = self.retrieve_rid(execution_rid)
        # Look up the workflow object from the RID
        workflow_rid = execution.get("Workflow")
        workflow = self.lookup_workflow(workflow_rid) if workflow_rid else None
        configuration = ExecutionConfiguration(
            workflow=workflow,
            description=execution["Description"],
        )

    # Create and return an execution instance
    return Execution(configuration, self, reload=execution_rid)  # type: ignore[arg-type]

retrieve_rid

retrieve_rid(
    rid: RID,
) -> dict[str, Any]

Retrieves complete record for RID.

Fetches all column values for the entity identified by the RID.

Parameters:

Name Type Description Default
rid RID

Resource Identifier of the record to retrieve.

required

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Dictionary containing all column values for the entity.

Raises:

Type Description
DerivaMLException

If the RID doesn't exist in the catalog.

Example

record = ml.retrieve_rid("1-abc123") print(f"Name: {record['name']}, Created: {record['creation_date']}")

Source code in src/deriva_ml/core/mixins/rid_resolution.py
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
def retrieve_rid(self, rid: RID) -> dict[str, Any]:
    """Retrieves complete record for RID.

    Fetches all column values for the entity identified by the RID.

    Args:
        rid: Resource Identifier of the record to retrieve.

    Returns:
        dict[str, Any]: Dictionary containing all column values for the entity.

    Raises:
        DerivaMLException: If the RID doesn't exist in the catalog.

    Example:
        >>> record = ml.retrieve_rid("1-abc123")
        >>> print(f"Name: {record['name']}, Created: {record['creation_date']}")
    """
    # Resolve RID and fetch the first (only) matching record
    return self.resolve_rid(rid).datapath.entities().fetch()[0]

select_by_workflow

select_by_workflow(
    records: list[FeatureRecord],
    workflow: str,
) -> FeatureRecord

Select the newest feature record created by a specific workflow.

Filters a list of FeatureRecord instances to only those whose Execution was created by a matching workflow, then returns the newest match by RCT. This is useful when multiple model runs or annotators have labeled the same data and you want to use values from a particular workflow.

Resolution chain:

The workflow argument is first tried as a Workflow RID. If no workflow is found with that RID, it is treated as a Workflow_Type name (e.g., "Training", "Feature_Creation"). The resolution chain is:

  1. workflow → Workflow.RID → all Executions for that workflow
  2. workflow → Workflow_Type.Name → all Workflows of that type → all Executions for those workflows

Matching records are then filtered by Execution and the newest (by RCT) is returned.

Note: Unlike FeatureRecord.select_newest, this method cannot be passed directly as a selector argument because it requires catalog access. Call it directly on a list of records instead.

Parameters:

Name Type Description Default
records list[FeatureRecord]

List of FeatureRecord instances to select from. Typically all values for a single target object from one feature.

required
workflow str

Either a Workflow RID (e.g., "2-ABC1") or a Workflow_Type name (e.g., "Training"). Auto-detected: tries RID lookup first, falls back to type name.

required

Returns:

Type Description
FeatureRecord

The newest FeatureRecord whose execution matches the workflow.

Raises:

Type Description
DerivaMLException

If no workflows match the given identifier, no executions exist for the matched workflow(s), or no records in the input list were created by matching executions.

Examples:

Select the newest label from any Training workflow::

>>> all_values = ml.list_feature_values("Image", "Classification")
>>> from collections import defaultdict
>>> by_image = defaultdict(list)
>>> for v in all_values:
...     by_image[v.Image].append(v)
>>> selected = {
...     img: ml.select_by_workflow(recs, "Training")
...     for img, recs in by_image.items()
... }

Select by a specific workflow RID::

>>> record = ml.select_by_workflow(records, "2-ABC1")
Source code in src/deriva_ml/core/mixins/feature.py
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
def select_by_workflow(
    self,
    records: list[FeatureRecord],
    workflow: str,
) -> FeatureRecord:
    """Select the newest feature record created by a specific workflow.

    Filters a list of FeatureRecord instances to only those whose
    ``Execution`` was created by a matching workflow, then returns the
    newest match by RCT. This is useful when multiple model runs or
    annotators have labeled the same data and you want to use values
    from a particular workflow.

    **Resolution chain:**

    The ``workflow`` argument is first tried as a Workflow RID. If no
    workflow is found with that RID, it is treated as a Workflow_Type
    name (e.g., ``"Training"``, ``"Feature_Creation"``). The resolution
    chain is:

    1. ``workflow`` → ``Workflow.RID`` → all Executions for that workflow
    2. ``workflow`` → ``Workflow_Type.Name`` → all Workflows of that type
       → all Executions for those workflows

    Matching records are then filtered by ``Execution`` and the newest
    (by RCT) is returned.

    Note: Unlike ``FeatureRecord.select_newest``, this method cannot be
    passed directly as a ``selector`` argument because it requires catalog
    access. Call it directly on a list of records instead.

    Args:
        records: List of FeatureRecord instances to select from. Typically
            all values for a single target object from one feature.
        workflow: Either a Workflow RID (e.g., ``"2-ABC1"``) or a
            Workflow_Type name (e.g., ``"Training"``). Auto-detected:
            tries RID lookup first, falls back to type name.

    Returns:
        The newest FeatureRecord whose execution matches the workflow.

    Raises:
        DerivaMLException: If no workflows match the given identifier,
            no executions exist for the matched workflow(s), or no
            records in the input list were created by matching executions.

    Examples:
        Select the newest label from any Training workflow::

            >>> all_values = ml.list_feature_values("Image", "Classification")
            >>> from collections import defaultdict
            >>> by_image = defaultdict(list)
            >>> for v in all_values:
            ...     by_image[v.Image].append(v)
            >>> selected = {
            ...     img: ml.select_by_workflow(recs, "Training")
            ...     for img, recs in by_image.items()
            ... }

        Select by a specific workflow RID::

            >>> record = ml.select_by_workflow(records, "2-ABC1")
    """
    # Determine matching execution RIDs
    matching_execution_rids: set[str] = set()

    # Try as a Workflow RID first
    try:
        wf = self.lookup_workflow(workflow)
        # Found a workflow — get all executions for this workflow
        for exec_record in self.find_executions(workflow=wf):
            matching_execution_rids.add(exec_record.execution_rid)
    except DerivaMLException:
        # Not a valid workflow RID — treat as Workflow_Type name
        pb = self.pathBuilder()
        wt_assoc = pb.schemas[self.ml_schema].Workflow_Workflow_Type
        matching_workflows = {
            row["Workflow"]
            for row in wt_assoc.filter(
                wt_assoc.Workflow_Type == workflow
            ).entities().fetch()
        }
        if not matching_workflows:
            raise DerivaMLException(
                f"No workflows found for workflow type '{workflow}'."
            )
        for exec_record in self.find_executions():
            if exec_record.workflow_rid in matching_workflows:
                matching_execution_rids.add(exec_record.execution_rid)

    if not matching_execution_rids:
        raise DerivaMLException(
            f"No executions found for workflow '{workflow}'."
        )

    # Filter records to those matching the workflow's executions
    filtered = [r for r in records if r.Execution in matching_execution_rids]
    if not filtered:
        raise DerivaMLException(
            f"No feature records match workflow '{workflow}'."
        )

    return FeatureRecord.select_newest(filtered)

set_column_display

set_column_display(
    table: str | Table,
    column_name: str,
    annotation: dict[str, Any] | None,
) -> str

Set the column-display annotation on a column.

Controls how a column's values are rendered, including custom formatting and markdown patterns. Changes are staged locally until apply_annotations() is called.

Parameters:

Name Type Description Default
table str | Table

Table name or Table object containing the column.

required
column_name str

Name of the column.

required
annotation dict[str, Any] | None

The column-display annotation value. Set to None to remove.

required

Returns:

Type Description
str

Column identifier (table.column).

Example

ml.set_column_display("Measurement", "Value", { ... "*": {"pre_format": {"format": "%.2f"}} ... }) ml.apply_annotations()

Source code in src/deriva_ml/core/mixins/annotation.py
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def set_column_display(
    self,
    table: str | Table,
    column_name: str,
    annotation: dict[str, Any] | None,
) -> str:
    """Set the column-display annotation on a column.

    Controls how a column's values are rendered, including custom
    formatting and markdown patterns.
    Changes are staged locally until apply_annotations() is called.

    Args:
        table: Table name or Table object containing the column.
        column_name: Name of the column.
        annotation: The column-display annotation value. Set to None to remove.

    Returns:
        Column identifier (table.column).

    Example:
        >>> ml.set_column_display("Measurement", "Value", {
        ...     "*": {"pre_format": {"format": "%.2f"}}
        ... })
        >>> ml.apply_annotations()
    """
    table_obj = self.model.name_to_table(table)
    column = table_obj.columns[column_name]

    if annotation is None:
        column.annotations.pop(COLUMN_DISPLAY_TAG, None)
    else:
        column.annotations[COLUMN_DISPLAY_TAG] = annotation

    return f"{table_obj.name}.{column_name}"

set_display_annotation

set_display_annotation(
    table: str | Table,
    annotation: dict[str, Any] | None,
    column_name: str | None = None,
) -> str

Set the display annotation on a table or column.

The display annotation controls basic naming and display options. Changes are staged locally until apply_annotations() is called.

Parameters:

Name Type Description Default
table str | Table

Table name or Table object.

required
annotation dict[str, Any] | None

The display annotation value. Set to None to remove.

required
column_name str | None

If provided, sets annotation on the column; otherwise on the table.

None

Returns:

Type Description
str

Target identifier (table name or table.column).

Example

ml.set_display_annotation("Image", {"name": "Images"}) ml.set_display_annotation("Image", {"name": "File Name"}, column_name="Filename") ml.apply_annotations() # Commit changes

Source code in src/deriva_ml/core/mixins/annotation.py
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def set_display_annotation(
    self,
    table: str | Table,
    annotation: dict[str, Any] | None,
    column_name: str | None = None,
) -> str:
    """Set the display annotation on a table or column.

    The display annotation controls basic naming and display options.
    Changes are staged locally until apply_annotations() is called.

    Args:
        table: Table name or Table object.
        annotation: The display annotation value. Set to None to remove.
        column_name: If provided, sets annotation on the column; otherwise on the table.

    Returns:
        Target identifier (table name or table.column).

    Example:
        >>> ml.set_display_annotation("Image", {"name": "Images"})
        >>> ml.set_display_annotation("Image", {"name": "File Name"}, column_name="Filename")
        >>> ml.apply_annotations()  # Commit changes
    """
    table_obj = self.model.name_to_table(table)

    if column_name:
        column = table_obj.columns[column_name]
        if annotation is None:
            column.annotations.pop(DISPLAY_TAG, None)
        else:
            column.annotations[DISPLAY_TAG] = annotation
        return f"{table_obj.name}.{column_name}"
    else:
        if annotation is None:
            table_obj.annotations.pop(DISPLAY_TAG, None)
        else:
            table_obj.annotations[DISPLAY_TAG] = annotation
        return table_obj.name

set_table_display

set_table_display(
    table: str | Table,
    annotation: dict[str, Any] | None,
) -> str

Set the table-display annotation on a table.

Controls table-level display options like row naming patterns, page size, and row ordering. Changes are staged locally until apply_annotations() is called.

Parameters:

Name Type Description Default
table str | Table

Table name or Table object.

required
annotation dict[str, Any] | None

The table-display annotation value. Set to None to remove.

required

Returns:

Type Description
str

Table name.

Example

ml.set_table_display("Subject", { ... "row_name": { ... "row_markdown_pattern": "{{{Name}}} ({{{Species}}})" ... } ... }) ml.apply_annotations()

Source code in src/deriva_ml/core/mixins/annotation.py
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def set_table_display(
    self,
    table: str | Table,
    annotation: dict[str, Any] | None,
) -> str:
    """Set the table-display annotation on a table.

    Controls table-level display options like row naming patterns,
    page size, and row ordering.
    Changes are staged locally until apply_annotations() is called.

    Args:
        table: Table name or Table object.
        annotation: The table-display annotation value. Set to None to remove.

    Returns:
        Table name.

    Example:
        >>> ml.set_table_display("Subject", {
        ...     "row_name": {
        ...         "row_markdown_pattern": "{{{Name}}} ({{{Species}}})"
        ...     }
        ... })
        >>> ml.apply_annotations()
    """
    table_obj = self.model.name_to_table(table)

    if annotation is None:
        table_obj.annotations.pop(TABLE_DISPLAY_TAG, None)
    else:
        table_obj.annotations[TABLE_DISPLAY_TAG] = annotation

    return table_obj.name

set_visible_columns

set_visible_columns(
    table: str | Table,
    annotation: dict[str, Any] | None,
) -> str

Set the visible-columns annotation on a table.

Controls which columns appear in different UI contexts and their order. Changes are staged locally until apply_annotations() is called.

Parameters:

Name Type Description Default
table str | Table

Table name or Table object.

required
annotation dict[str, Any] | None

The visible-columns annotation value. Set to None to remove.

required

Returns:

Type Description
str

Table name.

Example

ml.set_visible_columns("Image", { ... "compact": ["RID", "Filename", "Subject"], ... "detailed": ["RID", "Filename", "Subject", "Description"] ... }) ml.apply_annotations()

Source code in src/deriva_ml/core/mixins/annotation.py
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def set_visible_columns(
    self,
    table: str | Table,
    annotation: dict[str, Any] | None,
) -> str:
    """Set the visible-columns annotation on a table.

    Controls which columns appear in different UI contexts and their order.
    Changes are staged locally until apply_annotations() is called.

    Args:
        table: Table name or Table object.
        annotation: The visible-columns annotation value. Set to None to remove.

    Returns:
        Table name.

    Example:
        >>> ml.set_visible_columns("Image", {
        ...     "compact": ["RID", "Filename", "Subject"],
        ...     "detailed": ["RID", "Filename", "Subject", "Description"]
        ... })
        >>> ml.apply_annotations()
    """
    table_obj = self.model.name_to_table(table)

    if annotation is None:
        table_obj.annotations.pop(VISIBLE_COLUMNS_TAG, None)
    else:
        table_obj.annotations[VISIBLE_COLUMNS_TAG] = annotation

    return table_obj.name

set_visible_foreign_keys

set_visible_foreign_keys(
    table: str | Table,
    annotation: dict[str, Any] | None,
) -> str

Set the visible-foreign-keys annotation on a table.

Controls which related tables (via inbound foreign keys) appear in different UI contexts and their order. Changes are staged locally until apply_annotations() is called.

Parameters:

Name Type Description Default
table str | Table

Table name or Table object.

required
annotation dict[str, Any] | None

The visible-foreign-keys annotation value. Set to None to remove.

required

Returns:

Type Description
str

Table name.

Example

ml.set_visible_foreign_keys("Subject", { ... "detailed": [ ... ["domain", "Image_Subject_fkey"], ... ["domain", "Diagnosis_Subject_fkey"] ... ] ... }) ml.apply_annotations()

Source code in src/deriva_ml/core/mixins/annotation.py
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
@validate_call(config=ConfigDict(arbitrary_types_allowed=True))
def set_visible_foreign_keys(
    self,
    table: str | Table,
    annotation: dict[str, Any] | None,
) -> str:
    """Set the visible-foreign-keys annotation on a table.

    Controls which related tables (via inbound foreign keys) appear in
    different UI contexts and their order.
    Changes are staged locally until apply_annotations() is called.

    Args:
        table: Table name or Table object.
        annotation: The visible-foreign-keys annotation value. Set to None to remove.

    Returns:
        Table name.

    Example:
        >>> ml.set_visible_foreign_keys("Subject", {
        ...     "detailed": [
        ...         ["domain", "Image_Subject_fkey"],
        ...         ["domain", "Diagnosis_Subject_fkey"]
        ...     ]
        ... })
        >>> ml.apply_annotations()
    """
    table_obj = self.model.name_to_table(table)

    if annotation is None:
        table_obj.annotations.pop(VISIBLE_FOREIGN_KEYS_TAG, None)
    else:
        table_obj.annotations[VISIBLE_FOREIGN_KEYS_TAG] = annotation

    return table_obj.name

table_path

table_path(
    table: str | Table,
    schema: str | None = None,
) -> Path

Returns a local filesystem path for table CSV files.

Generates a standardized path where CSV files should be placed when preparing to upload data to a table. The path follows the project's directory structure conventions.

Parameters:

Name Type Description Default
table str | Table

Name of the table or Table object to get the path for.

required
schema str | None

Schema name for the path. If None, uses the table's schema or default_schema.

None

Returns:

Name Type Description
Path Path

Filesystem path where the CSV file should be placed.

Example

path = ml.table_path("experiment_results") df.to_csv(path) # Save data for upload

Source code in src/deriva_ml/core/mixins/path_builder.py
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def table_path(self, table: str | Table, schema: str | None = None) -> Path:
    """Returns a local filesystem path for table CSV files.

    Generates a standardized path where CSV files should be placed when preparing to upload data to a table.
    The path follows the project's directory structure conventions.

    Args:
        table: Name of the table or Table object to get the path for.
        schema: Schema name for the path. If None, uses the table's schema or default_schema.

    Returns:
        Path: Filesystem path where the CSV file should be placed.

    Example:
        >>> path = ml.table_path("experiment_results")
        >>> df.to_csv(path) # Save data for upload
    """
    table_obj = self.model.name_to_table(table)
    # Use table's schema if available, otherwise use provided schema or default
    schema = schema or table_obj.schema.name
    return _table_path(
        self.working_dir,
        schema=schema,
        table=table_obj.name,
    )

user_list

user_list() -> List[Dict[str, str]]

Returns catalog user list.

Retrieves basic information about all users who have access to the catalog, including their identifiers and full names.

Returns:

Type Description
List[Dict[str, str]]

List[Dict[str, str]]: List of user information dictionaries, each containing: - 'ID': User identifier - 'Full_Name': User's full name

Examples:

>>> users = ml.user_list()
>>> for user in users:
...     print(f"{user['Full_Name']} ({user['ID']})")
Source code in src/deriva_ml/core/base.py
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
def user_list(self) -> List[Dict[str, str]]:
    """Returns catalog user list.

    Retrieves basic information about all users who have access to the catalog, including their
    identifiers and full names.

    Returns:
        List[Dict[str, str]]: List of user information dictionaries, each containing:
            - 'ID': User identifier
            - 'Full_Name': User's full name

    Examples:

        >>> users = ml.user_list()
        >>> for user in users:
        ...     print(f"{user['Full_Name']} ({user['ID']})")
    """
    # Get the user table path and fetch basic user info
    user_path = self.pathBuilder().public.ERMrest_Client.path
    return [{"ID": u["ID"], "Full_Name": u["Full_Name"]} for u in user_path.entities().fetch()]

validate_schema

validate_schema(
    strict: bool = False,
) -> "SchemaValidationReport"

Validate that the catalog's ML schema matches the expected structure.

This method inspects the catalog schema and verifies that it contains all the required tables, columns, vocabulary terms, and relationships that are created by the ML schema initialization routines in create_schema.py.

The validation checks: - All required ML tables exist (Dataset, Execution, Workflow, etc.) - All required columns exist with correct types - All required vocabulary tables exist (Asset_Type, Dataset_Type, etc.) - All required vocabulary terms are initialized - All association tables exist for relationships

In strict mode, the validator also reports errors for: - Extra tables not in the expected schema - Extra columns not in the expected table definitions

Parameters:

Name Type Description Default
strict bool

If True, extra tables and columns are reported as errors. If False (default), they are reported as informational items. Use strict=True to verify a clean ML catalog matches exactly. Use strict=False to validate a catalog that may have domain extensions.

False

Returns:

Type Description
'SchemaValidationReport'

SchemaValidationReport with validation results. Key attributes: - is_valid: True if no errors were found - errors: List of error-level issues - warnings: List of warning-level issues - info: List of informational items - to_text(): Human-readable report - to_dict(): JSON-serializable dictionary

Example

ml = DerivaML('localhost', 'my_catalog') report = ml.validate_schema(strict=False) if report.is_valid: ... print("Schema is valid!") ... else: ... print(report.to_text())

Strict validation for a fresh ML catalog

report = ml.validate_schema(strict=True) print(f"Found {len(report.errors)} errors, {len(report.warnings)} warnings")

Get report as dictionary for JSON/logging

import json print(json.dumps(report.to_dict(), indent=2))

Note

This method validates the ML schema (typically 'deriva-ml'), not the domain schema. Domain-specific tables and columns are not checked unless they are part of the ML schema itself.

See Also
  • deriva_ml.schema.validation.SchemaValidationReport
  • deriva_ml.schema.validation.validate_ml_schema
Source code in src/deriva_ml/core/base.py
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
def validate_schema(self, strict: bool = False) -> "SchemaValidationReport":
    """Validate that the catalog's ML schema matches the expected structure.

    This method inspects the catalog schema and verifies that it contains all
    the required tables, columns, vocabulary terms, and relationships that are
    created by the ML schema initialization routines in create_schema.py.

    The validation checks:
    - All required ML tables exist (Dataset, Execution, Workflow, etc.)
    - All required columns exist with correct types
    - All required vocabulary tables exist (Asset_Type, Dataset_Type, etc.)
    - All required vocabulary terms are initialized
    - All association tables exist for relationships

    In strict mode, the validator also reports errors for:
    - Extra tables not in the expected schema
    - Extra columns not in the expected table definitions

    Args:
        strict: If True, extra tables and columns are reported as errors.
               If False (default), they are reported as informational items.
               Use strict=True to verify a clean ML catalog matches exactly.
               Use strict=False to validate a catalog that may have domain extensions.

    Returns:
        SchemaValidationReport with validation results. Key attributes:
            - is_valid: True if no errors were found
            - errors: List of error-level issues
            - warnings: List of warning-level issues
            - info: List of informational items
            - to_text(): Human-readable report
            - to_dict(): JSON-serializable dictionary

    Example:
        >>> ml = DerivaML('localhost', 'my_catalog')
        >>> report = ml.validate_schema(strict=False)
        >>> if report.is_valid:
        ...     print("Schema is valid!")
        ... else:
        ...     print(report.to_text())

        >>> # Strict validation for a fresh ML catalog
        >>> report = ml.validate_schema(strict=True)
        >>> print(f"Found {len(report.errors)} errors, {len(report.warnings)} warnings")

        >>> # Get report as dictionary for JSON/logging
        >>> import json
        >>> print(json.dumps(report.to_dict(), indent=2))

    Note:
        This method validates the ML schema (typically 'deriva-ml'), not the
        domain schema. Domain-specific tables and columns are not checked
        unless they are part of the ML schema itself.

    See Also:
        - deriva_ml.schema.validation.SchemaValidationReport
        - deriva_ml.schema.validation.validate_ml_schema
    """
    from deriva_ml.schema.validation import validate_ml_schema
    return validate_ml_schema(self, strict=strict)

DerivaMLConfig

Bases: BaseModel

Configuration model for DerivaML instances.

This Pydantic model defines all configurable parameters for a DerivaML instance. It can be used directly or via Hydra configuration files.

Attributes:

Name Type Description
hostname str

Hostname of the Deriva server (e.g., 'deriva.example.org').

catalog_id str | int

Catalog identifier, either numeric ID or catalog name.

domain_schemas str | set[str] | None

Optional set of domain schema names. If None, auto-detects all non-system schemas. Use this when working with catalogs that have multiple user-defined schemas.

default_schema str | None

The default schema for table creation operations. If None and there is exactly one domain schema, that schema is used. If there are multiple domain schemas, this must be specified for table creation to work without explicit schema parameters.

project_name str | None

Project name for organizing outputs. Defaults to default_schema.

cache_dir str | Path | None

Directory for caching downloaded datasets. Defaults to working_dir/cache.

working_dir str | Path | None

Base directory for computation data. Defaults to ~/deriva-ml.

hydra_runtime_output_dir str | Path | None

Hydra's runtime output directory (set automatically).

ml_schema str

Schema name for ML tables. Defaults to 'deriva-ml'.

logging_level Any

Logging level for DerivaML. Defaults to WARNING.

deriva_logging_level Any

Logging level for Deriva libraries. Defaults to WARNING.

credential Any

Authentication credentials. If None, retrieved automatically.

s3_bucket str | None

S3 bucket URL for dataset bag storage (e.g., 's3://my-bucket'). If provided, enables MINID creation and S3 upload for dataset exports. If None, MINID functionality is disabled regardless of use_minid setting.

use_minid bool | None

Whether to use MINID service for dataset bags. Only effective when s3_bucket is configured. Defaults to True when s3_bucket is set, False otherwise.

check_auth bool

Whether to verify authentication on connection. Defaults to True.

clean_execution_dir bool

Whether to automatically clean execution working directories after successful upload. Defaults to True. Set to False to retain local copies of execution outputs for debugging or manual inspection.

Example

config = DerivaMLConfig( ... hostname='deriva.example.org', ... catalog_id=1, ... default_schema='my_domain', ... logging_level=logging.INFO ... )

Source code in src/deriva_ml/core/config.py
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
class DerivaMLConfig(BaseModel):
    """Configuration model for DerivaML instances.

    This Pydantic model defines all configurable parameters for a DerivaML instance.
    It can be used directly or via Hydra configuration files.

    Attributes:
        hostname: Hostname of the Deriva server (e.g., 'deriva.example.org').
        catalog_id: Catalog identifier, either numeric ID or catalog name.
        domain_schemas: Optional set of domain schema names. If None, auto-detects all
            non-system schemas. Use this when working with catalogs that have multiple
            user-defined schemas.
        default_schema: The default schema for table creation operations. If None and
            there is exactly one domain schema, that schema is used. If there are multiple
            domain schemas, this must be specified for table creation to work without
            explicit schema parameters.
        project_name: Project name for organizing outputs. Defaults to default_schema.
        cache_dir: Directory for caching downloaded datasets. Defaults to working_dir/cache.
        working_dir: Base directory for computation data. Defaults to ~/deriva-ml.
        hydra_runtime_output_dir: Hydra's runtime output directory (set automatically).
        ml_schema: Schema name for ML tables. Defaults to 'deriva-ml'.
        logging_level: Logging level for DerivaML. Defaults to WARNING.
        deriva_logging_level: Logging level for Deriva libraries. Defaults to WARNING.
        credential: Authentication credentials. If None, retrieved automatically.
        s3_bucket: S3 bucket URL for dataset bag storage (e.g., 's3://my-bucket').
            If provided, enables MINID creation and S3 upload for dataset exports.
            If None, MINID functionality is disabled regardless of use_minid setting.
        use_minid: Whether to use MINID service for dataset bags. Only effective when
            s3_bucket is configured. Defaults to True when s3_bucket is set, False otherwise.
        check_auth: Whether to verify authentication on connection. Defaults to True.
        clean_execution_dir: Whether to automatically clean execution working directories
            after successful upload. Defaults to True. Set to False to retain local copies
            of execution outputs for debugging or manual inspection.

    Example:
        >>> config = DerivaMLConfig(
        ...     hostname='deriva.example.org',
        ...     catalog_id=1,
        ...     default_schema='my_domain',
        ...     logging_level=logging.INFO
        ... )
    """

    hostname: str
    catalog_id: str | int = 1
    domain_schemas: str | set[str] | None = None
    default_schema: str | None = None
    project_name: str | None = None
    cache_dir: str | Path | None = None
    working_dir: str | Path | None = None
    hydra_runtime_output_dir: str | Path | None = None
    ml_schema: str = ML_SCHEMA
    logging_level: Any = logging.WARNING
    deriva_logging_level: Any = logging.WARNING
    credential: Any = None
    s3_bucket: str | None = None
    use_minid: bool | None = None  # None means "auto" - True if s3_bucket is set
    check_auth: bool = True
    clean_execution_dir: bool = True

    @model_validator(mode="after")
    def init_working_dir(self) -> "DerivaMLConfig":
        """Initialize working directory and resolve use_minid after model validation.

        Sets up the working directory path, computing a default if not specified.
        Also captures Hydra's runtime output directory for logging and outputs.

        Resolves the use_minid flag based on s3_bucket configuration:
        - If use_minid is explicitly set, use that value (but it only takes effect if s3_bucket is set)
        - If use_minid is None (auto), set it to True if s3_bucket is configured, False otherwise

        This validator runs after all field validation and ensures the working
        directory is available for Hydra configuration resolution.

        Returns:
            Self: The configuration instance with initialized paths.
        """
        self.working_dir = DerivaMLConfig.compute_workdir(self.working_dir, self.catalog_id, self.hostname)
        self.hydra_runtime_output_dir = Path(HydraConfig.get().runtime.output_dir)

        # Resolve use_minid based on s3_bucket configuration
        if self.use_minid is None:
            # Auto mode: enable MINID if s3_bucket is configured
            self.use_minid = self.s3_bucket is not None
        elif self.use_minid and self.s3_bucket is None:
            # User requested MINID but no S3 bucket configured - disable MINID
            self.use_minid = False

        return self

    @staticmethod
    def compute_workdir(
        working_dir: str | Path | None,
        catalog_id: str | int | None = None,
        hostname: str | None = None,
    ) -> Path:
        """Compute the effective working directory path.

        Creates a standardized working directory path. If a base directory is provided,
        appends the current username to prevent conflicts between users. If no directory
        is provided, uses ~/.deriva-ml. The hostname and catalog_id are appended to
        separate data from different servers and catalogs.

        Args:
            working_dir: Base working directory path, or None for default.
            catalog_id: Catalog identifier to include in the path. If None, no
                       catalog subdirectory is created.
            hostname: Server hostname to include in the path. If None, no
                     hostname subdirectory is created.

        Returns:
            Path: Absolute path to the working directory.

        Example:
            >>> DerivaMLConfig.compute_workdir('/shared/data', '52', 'ml.example.org')
            PosixPath('/shared/data/username/deriva-ml/ml.example.org/52')
            >>> DerivaMLConfig.compute_workdir(None, 1, 'localhost')
            PosixPath('/home/username/.deriva-ml/localhost/1')
        """
        # Append username and deriva-ml to provided path, or use ~/.deriva-ml as base
        if working_dir:
            base_dir = Path(working_dir) / getpass.getuser() / "deriva-ml"
        else:
            base_dir = Path.home() / ".deriva-ml"
        # Append hostname if provided to separate data from different servers
        if hostname is not None:
            base_dir = base_dir / hostname
        # Append catalog_id if provided
        if catalog_id is not None:
            base_dir = base_dir / str(catalog_id)
        return base_dir.absolute()

compute_workdir staticmethod

compute_workdir(
    working_dir: str | Path | None,
    catalog_id: str | int | None = None,
    hostname: str | None = None,
) -> Path

Compute the effective working directory path.

Creates a standardized working directory path. If a base directory is provided, appends the current username to prevent conflicts between users. If no directory is provided, uses ~/.deriva-ml. The hostname and catalog_id are appended to separate data from different servers and catalogs.

Parameters:

Name Type Description Default
working_dir str | Path | None

Base working directory path, or None for default.

required
catalog_id str | int | None

Catalog identifier to include in the path. If None, no catalog subdirectory is created.

None
hostname str | None

Server hostname to include in the path. If None, no hostname subdirectory is created.

None

Returns:

Name Type Description
Path Path

Absolute path to the working directory.

Example

DerivaMLConfig.compute_workdir('/shared/data', '52', 'ml.example.org') PosixPath('/shared/data/username/deriva-ml/ml.example.org/52') DerivaMLConfig.compute_workdir(None, 1, 'localhost') PosixPath('/home/username/.deriva-ml/localhost/1')

Source code in src/deriva_ml/core/config.py
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def compute_workdir(
    working_dir: str | Path | None,
    catalog_id: str | int | None = None,
    hostname: str | None = None,
) -> Path:
    """Compute the effective working directory path.

    Creates a standardized working directory path. If a base directory is provided,
    appends the current username to prevent conflicts between users. If no directory
    is provided, uses ~/.deriva-ml. The hostname and catalog_id are appended to
    separate data from different servers and catalogs.

    Args:
        working_dir: Base working directory path, or None for default.
        catalog_id: Catalog identifier to include in the path. If None, no
                   catalog subdirectory is created.
        hostname: Server hostname to include in the path. If None, no
                 hostname subdirectory is created.

    Returns:
        Path: Absolute path to the working directory.

    Example:
        >>> DerivaMLConfig.compute_workdir('/shared/data', '52', 'ml.example.org')
        PosixPath('/shared/data/username/deriva-ml/ml.example.org/52')
        >>> DerivaMLConfig.compute_workdir(None, 1, 'localhost')
        PosixPath('/home/username/.deriva-ml/localhost/1')
    """
    # Append username and deriva-ml to provided path, or use ~/.deriva-ml as base
    if working_dir:
        base_dir = Path(working_dir) / getpass.getuser() / "deriva-ml"
    else:
        base_dir = Path.home() / ".deriva-ml"
    # Append hostname if provided to separate data from different servers
    if hostname is not None:
        base_dir = base_dir / hostname
    # Append catalog_id if provided
    if catalog_id is not None:
        base_dir = base_dir / str(catalog_id)
    return base_dir.absolute()

init_working_dir

init_working_dir() -> DerivaMLConfig

Initialize working directory and resolve use_minid after model validation.

Sets up the working directory path, computing a default if not specified. Also captures Hydra's runtime output directory for logging and outputs.

Resolves the use_minid flag based on s3_bucket configuration: - If use_minid is explicitly set, use that value (but it only takes effect if s3_bucket is set) - If use_minid is None (auto), set it to True if s3_bucket is configured, False otherwise

This validator runs after all field validation and ensures the working directory is available for Hydra configuration resolution.

Returns:

Name Type Description
Self DerivaMLConfig

The configuration instance with initialized paths.

Source code in src/deriva_ml/core/config.py
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
@model_validator(mode="after")
def init_working_dir(self) -> "DerivaMLConfig":
    """Initialize working directory and resolve use_minid after model validation.

    Sets up the working directory path, computing a default if not specified.
    Also captures Hydra's runtime output directory for logging and outputs.

    Resolves the use_minid flag based on s3_bucket configuration:
    - If use_minid is explicitly set, use that value (but it only takes effect if s3_bucket is set)
    - If use_minid is None (auto), set it to True if s3_bucket is configured, False otherwise

    This validator runs after all field validation and ensures the working
    directory is available for Hydra configuration resolution.

    Returns:
        Self: The configuration instance with initialized paths.
    """
    self.working_dir = DerivaMLConfig.compute_workdir(self.working_dir, self.catalog_id, self.hostname)
    self.hydra_runtime_output_dir = Path(HydraConfig.get().runtime.output_dir)

    # Resolve use_minid based on s3_bucket configuration
    if self.use_minid is None:
        # Auto mode: enable MINID if s3_bucket is configured
        self.use_minid = self.s3_bucket is not None
    elif self.use_minid and self.s3_bucket is None:
        # User requested MINID but no S3 bucket configured - disable MINID
        self.use_minid = False

    return self

DerivaMLException

Bases: Exception

Base exception class for all DerivaML errors.

This is the root exception for all DerivaML-specific errors. Catching this exception will catch any error raised by the DerivaML library.

Attributes:

Name Type Description
_msg

The error message stored for later access.

Parameters:

Name Type Description Default
msg str

Descriptive error message. Defaults to empty string.

''
Example

raise DerivaMLException("Failed to connect to catalog") DerivaMLException: Failed to connect to catalog

Source code in src/deriva_ml/core/exceptions.py
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
class DerivaMLException(Exception):
    """Base exception class for all DerivaML errors.

    This is the root exception for all DerivaML-specific errors. Catching this
    exception will catch any error raised by the DerivaML library.

    Attributes:
        _msg: The error message stored for later access.

    Args:
        msg: Descriptive error message. Defaults to empty string.

    Example:
        >>> raise DerivaMLException("Failed to connect to catalog")
        DerivaMLException: Failed to connect to catalog
    """

    def __init__(self, msg: str = "") -> None:
        super().__init__(msg)
        self._msg = msg

DerivaMLInvalidTerm

Bases: DerivaMLNotFoundError

Exception raised when a vocabulary term is not found or invalid.

Raised when attempting to look up or use a term that doesn't exist in a controlled vocabulary table, or when a term name/synonym cannot be resolved.

Parameters:

Name Type Description Default
vocabulary str

Name of the vocabulary table being searched.

required
term str

The term name that was not found.

required
msg str

Additional context about the error. Defaults to "Term doesn't exist".

"Term doesn't exist"
Example

raise DerivaMLInvalidTerm("Diagnosis", "unknown_condition") DerivaMLInvalidTerm: Invalid term unknown_condition in vocabulary Diagnosis: Term doesn't exist.

Source code in src/deriva_ml/core/exceptions.py
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
class DerivaMLInvalidTerm(DerivaMLNotFoundError):
    """Exception raised when a vocabulary term is not found or invalid.

    Raised when attempting to look up or use a term that doesn't exist in
    a controlled vocabulary table, or when a term name/synonym cannot be resolved.

    Args:
        vocabulary: Name of the vocabulary table being searched.
        term: The term name that was not found.
        msg: Additional context about the error. Defaults to "Term doesn't exist".

    Example:
        >>> raise DerivaMLInvalidTerm("Diagnosis", "unknown_condition")
        DerivaMLInvalidTerm: Invalid term unknown_condition in vocabulary Diagnosis: Term doesn't exist.
    """

    def __init__(self, vocabulary: str, term: str, msg: str = "Term doesn't exist") -> None:
        super().__init__(f"Invalid term {term} in vocabulary {vocabulary}: {msg}.")
        self.vocabulary = vocabulary
        self.term = term

DerivaMLTableTypeError

Bases: DerivaMLDataError

Exception raised when a RID or table is not of the expected type.

Raised when an operation requires a specific table type (e.g., Dataset, Execution) but receives a RID or table reference of a different type.

Parameters:

Name Type Description Default
table_type str

The expected table type (e.g., "Dataset", "Execution").

required
table str

The actual table name or RID that was provided.

required
Example

raise DerivaMLTableTypeError("Dataset", "1-ABC123") DerivaMLTableTypeError: Table 1-ABC123 is not of type Dataset.

Source code in src/deriva_ml/core/exceptions.py
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
class DerivaMLTableTypeError(DerivaMLDataError):
    """Exception raised when a RID or table is not of the expected type.

    Raised when an operation requires a specific table type (e.g., Dataset,
    Execution) but receives a RID or table reference of a different type.

    Args:
        table_type: The expected table type (e.g., "Dataset", "Execution").
        table: The actual table name or RID that was provided.

    Example:
        >>> raise DerivaMLTableTypeError("Dataset", "1-ABC123")
        DerivaMLTableTypeError: Table 1-ABC123 is not of type Dataset.
    """

    def __init__(self, table_type: str, table: str) -> None:
        super().__init__(f"Table {table} is not of type {table_type}.")
        self.table_type = table_type
        self.table = table

ExecAssetType

Bases: BaseStrEnum

Execution asset type identifiers.

Defines the types of assets that can be produced or consumed during an execution. These types are used to categorize files associated with workflow runs.

Attributes:

Name Type Description
input_file str

Input file consumed by the execution.

output_file str

Output file produced by the execution.

notebook_output str

Jupyter notebook output from the execution.

model_file str

Machine learning model file (e.g., .pkl, .h5, .pt).

Source code in src/deriva_ml/core/enums.py
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
class ExecAssetType(BaseStrEnum):
    """Execution asset type identifiers.

    Defines the types of assets that can be produced or consumed during an execution.
    These types are used to categorize files associated with workflow runs.

    Attributes:
        input_file (str): Input file consumed by the execution.
        output_file (str): Output file produced by the execution.
        notebook_output (str): Jupyter notebook output from the execution.
        model_file (str): Machine learning model file (e.g., .pkl, .h5, .pt).
    """

    input_file = "Input_File"
    output_file = "Output_File"
    notebook_output = "Notebook_Output"
    model_file = "Model_File"

ExecMetadataType

Bases: BaseStrEnum

Execution metadata type identifiers.

Defines the types of metadata that can be associated with an execution.

Attributes:

Name Type Description
execution_config str

General execution configuration data.

runtime_env str

Runtime environment information.

hydra_config str

Hydra YAML configuration files (config.yaml, overrides.yaml).

deriva_config str

DerivaML execution configuration (configuration.json).

Source code in src/deriva_ml/core/enums.py
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
class ExecMetadataType(BaseStrEnum):
    """Execution metadata type identifiers.

    Defines the types of metadata that can be associated with an execution.

    Attributes:
        execution_config (str): General execution configuration data.
        runtime_env (str): Runtime environment information.
        hydra_config (str): Hydra YAML configuration files (config.yaml, overrides.yaml).
        deriva_config (str): DerivaML execution configuration (configuration.json).
    """

    execution_config = "Execution_Config"
    runtime_env = "Runtime_Env"
    hydra_config = "Hydra_Config"
    deriva_config = "Deriva_Config"

FileSpec

Bases: BaseModel

Specification for a file to be added to the Deriva catalog.

Represents file metadata required for creating entries in the File table. Handles URL normalization, ensuring local file paths are converted to tag URIs that uniquely identify the file's origin.

Attributes:

Name Type Description
url str

File location as URL or local path. Local paths are converted to tag URIs.

md5 str

MD5 checksum for integrity verification.

length int

File size in bytes.

description str | None

Optional description of the file's contents or purpose.

file_types list[str] | None

List of file type classifications from the Asset_Type vocabulary.

Note

The 'File' type is automatically added to file_types if not present when using create_filespecs().

Example

spec = FileSpec( ... url="/data/results.csv", ... md5="d41d8cd98f00b204e9800998ecf8427e", ... length=1024, ... description="Analysis results", ... file_types=["CSV", "Data"] ... )

Source code in src/deriva_ml/core/filespec.py
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
class FileSpec(BaseModel):
    """Specification for a file to be added to the Deriva catalog.

    Represents file metadata required for creating entries in the File table.
    Handles URL normalization, ensuring local file paths are converted to
    tag URIs that uniquely identify the file's origin.

    Attributes:
        url: File location as URL or local path. Local paths are converted to tag URIs.
        md5: MD5 checksum for integrity verification.
        length: File size in bytes.
        description: Optional description of the file's contents or purpose.
        file_types: List of file type classifications from the Asset_Type vocabulary.

    Note:
        The 'File' type is automatically added to file_types if not present when
        using create_filespecs().

    Example:
        >>> spec = FileSpec(
        ...     url="/data/results.csv",
        ...     md5="d41d8cd98f00b204e9800998ecf8427e",
        ...     length=1024,
        ...     description="Analysis results",
        ...     file_types=["CSV", "Data"]
        ... )
    """

    model_config = {"populate_by_name": True}

    url: str = Field(alias="URL")
    md5: str = Field(alias="MD5")
    length: int = Field(alias="Length")
    description: str | None = Field(default="", alias="Description")
    file_types: list[str] | None = Field(default_factory=list)

    @field_validator("url")
    @classmethod
    def validate_file_url(cls, url: str) -> str:
        """Examine the provided URL. If it's a local path, convert it into a tag URL.

        Args:
            url: The URL to validate and potentially convert

        Returns:
            The validated/converted URL

        Raises:
            ValidationError: If the URL is not a file URL
        """
        url_parts = urlparse(url)
        if url_parts.scheme == "tag":
            # Already a tag URL, so just return it.
            return url
        elif (not url_parts.scheme) or url_parts.scheme == "file":
            # There is no scheme part of the URL, or it is a file URL, so it is a local file path.
            # Convert to a tag URL.
            return f"tag://{gethostname()},{date.today()}:file://{url_parts.path}"
        else:
            raise ValueError("url is not a file URL")

    @classmethod
    def create_filespecs(
        cls, path: Path | str, description: str, file_types: list[str] | Callable[[Path], list[str]] | None = None
    ) -> Generator[FileSpec, None, None]:
        """Generate FileSpec objects for a file or directory.

        Creates FileSpec objects with computed MD5 checksums for each file found.
        For directories, recursively processes all files. The 'File' type is
        automatically prepended to file_types if not already present.

        Args:
            path: Path to a file or directory. If directory, all files are processed recursively.
            description: Description to apply to all generated FileSpecs.
            file_types: Either a static list of file types, or a callable that takes a Path
                and returns a list of types for that specific file. Allows dynamic type
                assignment based on file extension, content, etc.

        Yields:
            FileSpec: A specification for each file with computed checksums and metadata.

        Example:
            Static file types:
                >>> specs = FileSpec.create_filespecs("/data/images", "Images", ["Image"])

            Dynamic file types based on extension:
                >>> def get_types(path):
                ...     ext = path.suffix.lower()
                ...     return {"png": ["PNG", "Image"], ".jpg": ["JPEG", "Image"]}.get(ext, [])
                >>> specs = FileSpec.create_filespecs("/data", "Mixed files", get_types)
        """
        path = Path(path)
        file_types = file_types or []
        # Convert static list to callable for uniform handling
        file_types_fn = file_types if callable(file_types) else lambda _x: file_types

        def create_spec(file_path: Path) -> FileSpec:
            """Create a FileSpec for a single file with computed hashes."""
            hashes = hash_utils.compute_file_hashes(file_path, hashes=frozenset(["md5", "sha256"]))
            md5 = hashes["md5"][0]
            type_list = file_types_fn(file_path)
            return FileSpec(
                length=path.stat().st_size,
                md5=md5,
                description=description,
                url=file_path.as_posix(),
                # Ensure 'File' type is always included
                file_types=type_list if "File" in type_list else ["File"] + type_list,
            )

        # Handle both single files and directories (recursive)
        files = [path] if path.is_file() else [f for f in Path(path).rglob("*") if f.is_file()]
        return (create_spec(file) for file in files)

    @staticmethod
    def read_filespec(path: Path | str) -> Generator[FileSpec, None, None]:
        """Read FileSpec objects from a JSON Lines file.

        Parses a JSONL file where each line is a JSON object representing a FileSpec.
        Empty lines are skipped. This is useful for batch processing pre-computed
        file specifications.

        Args:
            path: Path to the .jsonl file containing FileSpec data.

        Yields:
            FileSpec: Parsed FileSpec object for each valid line.

        Example:
            >>> for spec in FileSpec.read_filespec("files.jsonl"):
            ...     print(f"{spec.url}: {spec.md5}")
        """
        path = Path(path)
        with path.open("r", encoding="utf-8") as f:
            for line in f:
                line = line.strip()
                if not line:
                    continue
                yield FileSpec(**json.loads(line))

create_filespecs classmethod

create_filespecs(
    path: Path | str,
    description: str,
    file_types: list[str]
    | Callable[[Path], list[str]]
    | None = None,
) -> Generator[FileSpec, None, None]

Generate FileSpec objects for a file or directory.

Creates FileSpec objects with computed MD5 checksums for each file found. For directories, recursively processes all files. The 'File' type is automatically prepended to file_types if not already present.

Parameters:

Name Type Description Default
path Path | str

Path to a file or directory. If directory, all files are processed recursively.

required
description str

Description to apply to all generated FileSpecs.

required
file_types list[str] | Callable[[Path], list[str]] | None

Either a static list of file types, or a callable that takes a Path and returns a list of types for that specific file. Allows dynamic type assignment based on file extension, content, etc.

None

Yields:

Name Type Description
FileSpec FileSpec

A specification for each file with computed checksums and metadata.

Example

Static file types: >>> specs = FileSpec.create_filespecs("/data/images", "Images", ["Image"])

Dynamic file types based on extension: >>> def get_types(path): ... ext = path.suffix.lower() ... return {"png": ["PNG", "Image"], ".jpg": ["JPEG", "Image"]}.get(ext, []) >>> specs = FileSpec.create_filespecs("/data", "Mixed files", get_types)

Source code in src/deriva_ml/core/filespec.py
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
@classmethod
def create_filespecs(
    cls, path: Path | str, description: str, file_types: list[str] | Callable[[Path], list[str]] | None = None
) -> Generator[FileSpec, None, None]:
    """Generate FileSpec objects for a file or directory.

    Creates FileSpec objects with computed MD5 checksums for each file found.
    For directories, recursively processes all files. The 'File' type is
    automatically prepended to file_types if not already present.

    Args:
        path: Path to a file or directory. If directory, all files are processed recursively.
        description: Description to apply to all generated FileSpecs.
        file_types: Either a static list of file types, or a callable that takes a Path
            and returns a list of types for that specific file. Allows dynamic type
            assignment based on file extension, content, etc.

    Yields:
        FileSpec: A specification for each file with computed checksums and metadata.

    Example:
        Static file types:
            >>> specs = FileSpec.create_filespecs("/data/images", "Images", ["Image"])

        Dynamic file types based on extension:
            >>> def get_types(path):
            ...     ext = path.suffix.lower()
            ...     return {"png": ["PNG", "Image"], ".jpg": ["JPEG", "Image"]}.get(ext, [])
            >>> specs = FileSpec.create_filespecs("/data", "Mixed files", get_types)
    """
    path = Path(path)
    file_types = file_types or []
    # Convert static list to callable for uniform handling
    file_types_fn = file_types if callable(file_types) else lambda _x: file_types

    def create_spec(file_path: Path) -> FileSpec:
        """Create a FileSpec for a single file with computed hashes."""
        hashes = hash_utils.compute_file_hashes(file_path, hashes=frozenset(["md5", "sha256"]))
        md5 = hashes["md5"][0]
        type_list = file_types_fn(file_path)
        return FileSpec(
            length=path.stat().st_size,
            md5=md5,
            description=description,
            url=file_path.as_posix(),
            # Ensure 'File' type is always included
            file_types=type_list if "File" in type_list else ["File"] + type_list,
        )

    # Handle both single files and directories (recursive)
    files = [path] if path.is_file() else [f for f in Path(path).rglob("*") if f.is_file()]
    return (create_spec(file) for file in files)

read_filespec staticmethod

read_filespec(
    path: Path | str,
) -> Generator[FileSpec, None, None]

Read FileSpec objects from a JSON Lines file.

Parses a JSONL file where each line is a JSON object representing a FileSpec. Empty lines are skipped. This is useful for batch processing pre-computed file specifications.

Parameters:

Name Type Description Default
path Path | str

Path to the .jsonl file containing FileSpec data.

required

Yields:

Name Type Description
FileSpec FileSpec

Parsed FileSpec object for each valid line.

Example

for spec in FileSpec.read_filespec("files.jsonl"): ... print(f"{spec.url}: {spec.md5}")

Source code in src/deriva_ml/core/filespec.py
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
@staticmethod
def read_filespec(path: Path | str) -> Generator[FileSpec, None, None]:
    """Read FileSpec objects from a JSON Lines file.

    Parses a JSONL file where each line is a JSON object representing a FileSpec.
    Empty lines are skipped. This is useful for batch processing pre-computed
    file specifications.

    Args:
        path: Path to the .jsonl file containing FileSpec data.

    Yields:
        FileSpec: Parsed FileSpec object for each valid line.

    Example:
        >>> for spec in FileSpec.read_filespec("files.jsonl"):
        ...     print(f"{spec.url}: {spec.md5}")
    """
    path = Path(path)
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            yield FileSpec(**json.loads(line))

validate_file_url classmethod

validate_file_url(url: str) -> str

Examine the provided URL. If it's a local path, convert it into a tag URL.

Parameters:

Name Type Description Default
url str

The URL to validate and potentially convert

required

Returns:

Type Description
str

The validated/converted URL

Raises:

Type Description
ValidationError

If the URL is not a file URL

Source code in src/deriva_ml/core/filespec.py
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
@field_validator("url")
@classmethod
def validate_file_url(cls, url: str) -> str:
    """Examine the provided URL. If it's a local path, convert it into a tag URL.

    Args:
        url: The URL to validate and potentially convert

    Returns:
        The validated/converted URL

    Raises:
        ValidationError: If the URL is not a file URL
    """
    url_parts = urlparse(url)
    if url_parts.scheme == "tag":
        # Already a tag URL, so just return it.
        return url
    elif (not url_parts.scheme) or url_parts.scheme == "file":
        # There is no scheme part of the URL, or it is a file URL, so it is a local file path.
        # Convert to a tag URL.
        return f"tag://{gethostname()},{date.today()}:file://{url_parts.path}"
    else:
        raise ValueError("url is not a file URL")

FileUploadState

Bases: BaseModel

Tracks the state and result of a file upload operation.

Attributes:

Name Type Description
state UploadState

Current state of the upload (success, failed, etc.).

status str

Detailed status message.

result Any

Upload result data, if any.

Source code in src/deriva_ml/core/ermrest.py
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
class FileUploadState(BaseModel):
    """Tracks the state and result of a file upload operation.

    Attributes:
        state (UploadState): Current state of the upload (success, failed, etc.).
        status (str): Detailed status message.
        result (Any): Upload result data, if any.
    """
    state: UploadState
    status: str
    result: Any

    @computed_field
    @property
    def rid(self) -> RID | None:
        return self.result and self.result["RID"]

LoggerMixin

Mixin class that provides a _logger attribute.

Classes that inherit from this mixin get a _logger property that returns a child logger under the deriva_ml namespace, named after the class.

Example

class MyProcessor(LoggerMixin): ... def process(self): ... self._logger.info("Processing started") ...

Logs to 'deriva_ml.MyProcessor'

Source code in src/deriva_ml/core/logging_config.py
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
class LoggerMixin:
    """Mixin class that provides a _logger attribute.

    Classes that inherit from this mixin get a _logger property that
    returns a child logger under the deriva_ml namespace, named after
    the class.

    Example:
        >>> class MyProcessor(LoggerMixin):
        ...     def process(self):
        ...         self._logger.info("Processing started")
        ...
        >>> # Logs to 'deriva_ml.MyProcessor'
    """

    @property
    def _logger(self) -> logging.Logger:
        """Get the logger for this class."""
        return get_logger(self.__class__.__name__)

MLAsset

Bases: BaseStrEnum

Asset type identifiers.

Defines the types of assets that can be associated with executions.

Attributes:

Name Type Description
execution_metadata str

Metadata about an execution.

execution_asset str

Asset produced by an execution.

Source code in src/deriva_ml/core/enums.py
119
120
121
122
123
124
125
126
127
128
129
130
class MLAsset(BaseStrEnum):
    """Asset type identifiers.

    Defines the types of assets that can be associated with executions.

    Attributes:
        execution_metadata (str): Metadata about an execution.
        execution_asset (str): Asset produced by an execution.
    """

    execution_metadata = "Execution_Metadata"
    execution_asset = "Execution_Asset"

MLVocab

Bases: BaseStrEnum

Controlled vocabulary table identifiers.

Defines the names of controlled vocabulary tables used in DerivaML. These tables store standardized terms with descriptions and synonyms for consistent data classification across the catalog.

Attributes:

Name Type Description
dataset_type str

Dataset classification vocabulary (e.g., "Training", "Test").

workflow_type str

Workflow classification vocabulary (e.g., "Python", "Notebook").

asset_type str

Asset/file type classification vocabulary (e.g., "Image", "CSV").

asset_role str

Asset role vocabulary for execution relationships (e.g., "Input", "Output").

feature_name str

Feature name vocabulary for ML feature definitions.

Source code in src/deriva_ml/core/enums.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
class MLVocab(BaseStrEnum):
    """Controlled vocabulary table identifiers.

    Defines the names of controlled vocabulary tables used in DerivaML. These tables
    store standardized terms with descriptions and synonyms for consistent data
    classification across the catalog.

    Attributes:
        dataset_type (str): Dataset classification vocabulary (e.g., "Training", "Test").
        workflow_type (str): Workflow classification vocabulary (e.g., "Python", "Notebook").
        asset_type (str): Asset/file type classification vocabulary (e.g., "Image", "CSV").
        asset_role (str): Asset role vocabulary for execution relationships (e.g., "Input", "Output").
        feature_name (str): Feature name vocabulary for ML feature definitions.
    """

    dataset_type = "Dataset_Type"
    workflow_type = "Workflow_Type"
    asset_type = "Asset_Type"
    asset_role = "Asset_Role"
    feature_name = "Feature_Name"

UploadState

Bases: Enum

File upload operation states.

Represents the various states a file upload operation can be in, from initiation to completion.

Attributes:

Name Type Description
success int

Upload completed successfully.

failed int

Upload failed.

pending int

Upload is queued.

running int

Upload is in progress.

paused int

Upload is temporarily paused.

aborted int

Upload was aborted.

cancelled int

Upload was cancelled.

timeout int

Upload timed out.

Source code in src/deriva_ml/core/enums.py
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
class UploadState(Enum):
    """File upload operation states.

    Represents the various states a file upload operation can be in, from initiation to completion.

    Attributes:
        success (int): Upload completed successfully.
        failed (int): Upload failed.
        pending (int): Upload is queued.
        running (int): Upload is in progress.
        paused (int): Upload is temporarily paused.
        aborted (int): Upload was aborted.
        cancelled (int): Upload was cancelled.
        timeout (int): Upload timed out.
    """

    success = 0
    failed = 1
    pending = 2
    running = 3
    paused = 4
    aborted = 5
    cancelled = 6
    timeout = 7

configure_logging

configure_logging(
    level: int = logging.WARNING,
    deriva_level: int | None = None,
    format_string: str = DEFAULT_FORMAT,
    handler: Handler | None = None,
) -> logging.Logger

Configure logging for DerivaML and related libraries.

This function sets up logging levels for DerivaML, related libraries (deriva-py, bdbag, bagit), and Hydra loggers. It is designed to:

  1. Configure only specific logger namespaces, not the root logger
  2. Respect Hydra's logging configuration when running under Hydra
  3. Allow deriva-py libraries to have a separate logging level
The logging level hierarchy
  • deriva_ml logger: uses level
  • Hydra loggers: follow level (deriva_ml level)
  • Deriva/bdbag/bagit loggers: use deriva_level (defaults to level)
When running under Hydra
  • Only sets log levels on specific loggers
  • Does NOT add handlers (Hydra has already configured them)
  • Does NOT call basicConfig()

When running standalone (no Hydra): - Sets log levels on specific loggers - Adds a StreamHandler to deriva_ml logger if none exists - Still does NOT touch the root logger or call basicConfig()

Parameters:

Name Type Description Default
level int

Log level for deriva_ml and Hydra loggers. Defaults to WARNING.

WARNING
deriva_level int | None

Log level for deriva-py libraries (deriva, bagit, bdbag). If None, uses the same level as level.

None
format_string str

Format string for log messages (used only when adding handlers outside Hydra context).

DEFAULT_FORMAT
handler Handler | None

Optional handler to add to the deriva_ml logger. If None and not running under Hydra, uses StreamHandler with format_string.

None

Returns:

Type Description
Logger

The configured deriva_ml logger.

Example

import logging

Same level for everything

configure_logging(level=logging.DEBUG)

Verbose DerivaML, quieter deriva-py libraries

configure_logging( ... level=logging.INFO, ... deriva_level=logging.WARNING, ... )

Source code in src/deriva_ml/core/logging_config.py
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
def configure_logging(
    level: int = logging.WARNING,
    deriva_level: int | None = None,
    format_string: str = DEFAULT_FORMAT,
    handler: logging.Handler | None = None,
) -> logging.Logger:
    """Configure logging for DerivaML and related libraries.

    This function sets up logging levels for DerivaML, related libraries
    (deriva-py, bdbag, bagit), and Hydra loggers. It is designed to:

    1. Configure only specific logger namespaces, not the root logger
    2. Respect Hydra's logging configuration when running under Hydra
    3. Allow deriva-py libraries to have a separate logging level

    The logging level hierarchy:
        - deriva_ml logger: uses `level`
        - Hydra loggers: follow `level` (deriva_ml level)
        - Deriva/bdbag/bagit loggers: use `deriva_level` (defaults to `level`)

    When running under Hydra:
        - Only sets log levels on specific loggers
        - Does NOT add handlers (Hydra has already configured them)
        - Does NOT call basicConfig()

    When running standalone (no Hydra):
        - Sets log levels on specific loggers
        - Adds a StreamHandler to deriva_ml logger if none exists
        - Still does NOT touch the root logger or call basicConfig()

    Args:
        level: Log level for deriva_ml and Hydra loggers. Defaults to WARNING.
        deriva_level: Log level for deriva-py libraries (deriva, bagit, bdbag).
                     If None, uses the same level as `level`.
        format_string: Format string for log messages (used only when adding
                      handlers outside Hydra context).
        handler: Optional handler to add to the deriva_ml logger. If None and
                not running under Hydra, uses StreamHandler with format_string.

    Returns:
        The configured deriva_ml logger.

    Example:
        >>> import logging
        >>> # Same level for everything
        >>> configure_logging(level=logging.DEBUG)
        >>>
        >>> # Verbose DerivaML, quieter deriva-py libraries
        >>> configure_logging(
        ...     level=logging.INFO,
        ...     deriva_level=logging.WARNING,
        ... )
    """
    if deriva_level is None:
        deriva_level = level

    # Configure main DerivaML logger
    logger = get_logger()
    logger.setLevel(level)

    # Configure Hydra loggers to follow deriva_ml level
    for logger_name in HYDRA_LOGGERS:
        logging.getLogger(logger_name).setLevel(level)

    # Configure deriva-py and related library loggers
    for logger_name in DERIVA_LOGGERS:
        logging.getLogger(logger_name).setLevel(deriva_level)

    # Only add handlers if not running under Hydra
    # Hydra configures handlers via dictConfig, we don't want to duplicate
    if not is_hydra_initialized():
        if not logger.handlers:
            if handler is None:
                handler = logging.StreamHandler()
                handler.setFormatter(logging.Formatter(format_string))
            logger.addHandler(handler)

    return logger

get_logger

get_logger(
    name: str | None = None,
) -> logging.Logger

Get a DerivaML logger.

Parameters:

Name Type Description Default
name str | None

Optional sub-logger name. If provided, returns a child logger under the deriva_ml namespace (e.g., 'deriva_ml.dataset'). If None, returns the main deriva_ml logger.

None

Returns:

Type Description
Logger

The configured logger instance.

Example

logger = get_logger() # Main deriva_ml logger dataset_logger = get_logger("dataset") # deriva_ml.dataset

Source code in src/deriva_ml/core/logging_config.py
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
def get_logger(name: str | None = None) -> logging.Logger:
    """Get a DerivaML logger.

    Args:
        name: Optional sub-logger name. If provided, returns a child logger
              under the deriva_ml namespace (e.g., 'deriva_ml.dataset').
              If None, returns the main deriva_ml logger.

    Returns:
        The configured logger instance.

    Example:
        >>> logger = get_logger()  # Main deriva_ml logger
        >>> dataset_logger = get_logger("dataset")  # deriva_ml.dataset
    """
    if name is None:
        return logging.getLogger(LOGGER_NAME)
    return logging.getLogger(f"{LOGGER_NAME}.{name}")

is_hydra_initialized

is_hydra_initialized() -> bool

Check if running within an initialized Hydra context.

This is used to determine whether Hydra is managing logging configuration. When Hydra is initialized, we avoid adding handlers or calling basicConfig since Hydra has already configured logging via dictConfig.

Returns:

Type Description
bool

True if Hydra's GlobalHydra is initialized, False otherwise.

Example

if is_hydra_initialized(): ... # Hydra is managing logging ... pass

Source code in src/deriva_ml/core/logging_config.py
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
def is_hydra_initialized() -> bool:
    """Check if running within an initialized Hydra context.

    This is used to determine whether Hydra is managing logging configuration.
    When Hydra is initialized, we avoid adding handlers or calling basicConfig
    since Hydra has already configured logging via dictConfig.

    Returns:
        True if Hydra's GlobalHydra is initialized, False otherwise.

    Example:
        >>> if is_hydra_initialized():
        ...     # Hydra is managing logging
        ...     pass
    """
    try:
        from hydra.core.global_hydra import GlobalHydra

        return GlobalHydra.instance().is_initialized()
    except (ImportError, Exception):
        return False