DerivaML Class
The DerivaML class provides a range of methods to interact with a Deriva catalog.
These methods assume tha tthe catalog contains a deriva-ml and a domain schema.
Data Catalog: The catalog must include both the domain schema and a standard ML schema for effective data management.

- Domain schema: The domain schema includes the data collected or generated by domain-specific experiments or systems.
- ML schema: Each entity in the ML schema is designed to capture details of the ML development process. It including the following tables
- A Dataset represents a data collection, such as aggregation identified for training, validation, and testing purposes.
- A Workflow represents a specific sequence of computational steps or human interactions.
- An Execution is an instance of a workflow that a user instantiates at a specific time.
- An Execution Asset is an output file that results from the execution of a workflow.
- An Execution Metadata is an asset entity for saving metadata files referencing a given execution.
Core module for DerivaML.
This module provides the primary public interface to DerivaML functionality. It exports the main DerivaML class along with configuration, definitions, and exceptions needed for interacting with Deriva-based ML catalogs.
Key exports
- DerivaML: Main class for catalog operations and ML workflow management.
- DerivaMLConfig: Configuration class for DerivaML instances.
- Exceptions: DerivaMLException and specialized exception types.
- Definitions: Type definitions, enums, and constants used throughout the package.
Example
from deriva_ml.core import DerivaML, DerivaMLConfig ml = DerivaML('deriva.example.org', 'my_catalog') datasets = ml.find_datasets()
BuiltinTypes
module-attribute
BuiltinTypes = BuiltinType
Alias for BuiltinType from deriva.core.typed.
This maintains backwards compatibility with existing DerivaML code that uses the plural form 'BuiltinTypes'. New code should use BuiltinType directly.
ColumnDefinition
module-attribute
ColumnDefinition = ColumnDef
Alias for ColumnDef from deriva.core.typed.
This maintains backwards compatibility with existing DerivaML code. New code should use ColumnDef directly.
TableDefinition
module-attribute
TableDefinition = TableDef
Alias for TableDef from deriva.core.typed.
This maintains backwards compatibility with existing DerivaML code. New code should use TableDef directly.
DerivaML
Bases: PathBuilderMixin, RidResolutionMixin, VocabularyMixin, WorkflowMixin, FeatureMixin, DatasetMixin, AssetMixin, ExecutionMixin, FileMixin, AnnotationMixin, DerivaMLCatalog
Core class for machine learning operations on a Deriva catalog.
This class provides core functionality for managing ML workflows, features, and datasets in a Deriva catalog. It handles data versioning, feature management, vocabulary control, and execution tracking.
Attributes:
| Name | Type | Description |
|---|---|---|
host_name |
str
|
Hostname of the Deriva server (e.g., 'deriva.example.org'). |
catalog_id |
Union[str, int]
|
Catalog identifier or name. |
domain_schema |
str
|
Schema name for domain-specific tables and relationships. |
model |
DerivaModel
|
ERMRest model for the catalog. |
working_dir |
Path
|
Directory for storing computation data and results. |
cache_dir |
Path
|
Directory for caching downloaded datasets. |
ml_schema |
str
|
Schema name for ML-specific tables (default: 'deriva_ml'). |
configuration |
ExecutionConfiguration
|
Current execution configuration. |
project_name |
str
|
Name of the current project. |
start_time |
datetime
|
Timestamp when this instance was created. |
status |
str
|
Current status of operations. |
Example
ml = DerivaML('deriva.example.org', 'my_catalog') ml.create_feature('my_table', 'new_feature') ml.add_term('vocabulary_table', 'new_term', description='Description of term')
Source code in src/deriva_ml/core/base.py
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 | |
catalog_provenance
property
catalog_provenance: (
"CatalogProvenance | None"
)
Get the provenance information for this catalog.
Returns provenance information if the catalog has it set. This includes information about how the catalog was created (clone, create, schema), who created it, when, and any workflow information.
For cloned catalogs, additional details about the clone operation are
available in the clone_details attribute.
Returns:
| Type | Description |
|---|---|
'CatalogProvenance | None'
|
CatalogProvenance if available, None otherwise. |
Example
ml = DerivaML('localhost', '45') prov = ml.catalog_provenance if prov: ... print(f"Created: {prov.created_at} by {prov.created_by}") ... print(f"Method: {prov.creation_method.value}") ... if prov.is_clone: ... print(f"Cloned from: {prov.clone_details.source_hostname}")
working_data
property
working_data
Access the working data cache for this catalog.
Returns a :class:WorkingDataCache backed by a SQLite database in
the working directory. Use this to cache catalog query results
(tables, denormalized views, feature values) for reuse across scripts.
Example::
# Cache a full table
df = ml.cache_table("Subject")
# Check what's cached
ml.working_data.list_tables()
# Clear the cache
ml.working_data.clear()
__del__
__del__() -> None
Cleanup method to handle incomplete executions.
Source code in src/deriva_ml/core/base.py
340 341 342 343 344 345 346 347 | |
__init__
__init__(
hostname: str,
catalog_id: str | int,
domain_schemas: str
| set[str]
| None = None,
default_schema: str | None = None,
project_name: str | None = None,
cache_dir: str | Path | None = None,
working_dir: str
| Path
| None = None,
hydra_runtime_output_dir: str
| Path
| None = None,
ml_schema: str = ML_SCHEMA,
logging_level: int = logging.WARNING,
deriva_logging_level: int = logging.WARNING,
credential: dict | None = None,
s3_bucket: str | None = None,
use_minid: bool | None = None,
check_auth: bool = True,
clean_execution_dir: bool = True,
) -> None
Initializes a DerivaML instance.
This method will connect to a catalog and initialize local configuration for the ML execution. This class is intended to be used as a base class on which domain-specific interfaces are built.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hostname
|
str
|
Hostname of the Deriva server. |
required |
catalog_id
|
str | int
|
Catalog ID. Either an identifier or a catalog name. |
required |
domain_schemas
|
str | set[str] | None
|
Optional set of domain schema names. If None, auto-detects all non-system schemas. Use this when working with catalogs that have multiple user-defined schemas. |
None
|
default_schema
|
str | None
|
The default schema for table creation operations. If None and there is exactly one domain schema, that schema is used. If there are multiple domain schemas, this must be specified for table creation to work without explicit schema parameters. |
None
|
ml_schema
|
str
|
Schema name for ML schema. Used if you have a non-standard configuration of deriva-ml. |
ML_SCHEMA
|
project_name
|
str | None
|
Project name. Defaults to name of default_schema. |
None
|
cache_dir
|
str | Path | None
|
Directory path for caching data downloaded from the Deriva server as bdbag. If not provided, will default to working_dir. |
None
|
working_dir
|
str | Path | None
|
Directory path for storing data used by or generated by any computations. If no value is provided, will default to ${HOME}/deriva_ml |
None
|
s3_bucket
|
str | None
|
S3 bucket URL for dataset bag storage (e.g., 's3://my-bucket'). If provided, enables MINID creation and S3 upload for dataset exports. If None, MINID functionality is disabled regardless of use_minid setting. |
None
|
use_minid
|
bool | None
|
Use the MINID service when downloading dataset bags. Only effective when s3_bucket is configured. If None (default), automatically set to True when s3_bucket is provided, False otherwise. |
None
|
check_auth
|
bool
|
Check if the user has access to the catalog. |
True
|
clean_execution_dir
|
bool
|
Whether to automatically clean up execution working directories after successful upload. Defaults to True. Set to False to retain local copies. |
True
|
Source code in src/deriva_ml/core/base.py
212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 | |
add_dataset_element_type
add_dataset_element_type(
element: str | Table,
) -> Table
Makes it possible to add objects from the specified table to a dataset.
A dataset is a heterogeneous collection of objects, each of which comes from a different table. This routine adds the specified table as a valid element type for datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
element
|
str | Table
|
Name of the table or table object that is to be added to the dataset. |
required |
Returns:
| Type | Description |
|---|---|
Table
|
The table object that was added to the dataset. |
Source code in src/deriva_ml/core/mixins/dataset.py
178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | |
add_features
add_features(
features: list[FeatureRecord],
) -> int
Add feature values to the catalog in batch.
Inserts a list of FeatureRecord instances into the appropriate feature table.
All records must be from the same feature (i.e., created by the same
feature_record_class()). Records are batch-inserted for efficiency.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features
|
list[FeatureRecord]
|
List of FeatureRecord instances to insert. All must share
the same feature definition (same |
required |
Returns:
| Type | Description |
|---|---|
int
|
Number of feature records inserted. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If features list is empty. |
Example
feature = ml.lookup_feature("Image", "Classification") RecordClass = feature.feature_record_class() records = [ ... RecordClass(Image="1-ABC", Image_Class="Normal", Execution=exe_rid), ... RecordClass(Image="1-DEF", Image_Class="Abnormal", Execution=exe_rid), ... ] count = ml.add_features(records) print(f"Inserted {count} feature values")
Source code in src/deriva_ml/core/mixins/feature.py
345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 | |
add_files
add_files(
files: Iterable[FileSpec],
execution_rid: RID,
dataset_types: str
| list[str]
| None = None,
description: str = "",
) -> "Dataset"
Adds files to the catalog with their metadata.
Registers files in the catalog along with their metadata (MD5, length, URL) and associates them with specified file types. Links files to the specified execution record for provenance tracking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
files
|
Iterable[FileSpec]
|
File specifications containing MD5 checksum, length, and URL. |
required |
execution_rid
|
RID
|
Execution RID to associate files with (required for provenance). |
required |
dataset_types
|
str | list[str] | None
|
One or more dataset type terms from File_Type vocabulary. |
None
|
description
|
str
|
Description of the files. |
''
|
Returns:
| Name | Type | Description |
|---|---|---|
Dataset |
'Dataset'
|
Dataset that represents the newly added files. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If file_types are invalid or execution_rid is not an execution record. |
Examples:
Add files via an execution: >>> with ml.create_execution(config) as exe: ... files = [FileSpec(url="path/to/file.txt", md5="abc123", length=1000)] ... dataset = exe.add_files(files, dataset_types="text")
Source code in src/deriva_ml/core/mixins/file.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 | |
add_page
add_page(
title: str, content: str
) -> None
Adds page to web interface.
Creates a new page in the catalog's web interface with the specified title and content. The page will be accessible through the catalog's navigation system.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
title
|
str
|
The title of the page to be displayed in navigation and headers. |
required |
content
|
str
|
The main content of the page can include HTML markup. |
required |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If the page creation fails or the user lacks necessary permissions. |
Example
ml.add_page( ... title="Analysis Results", ... content="
Results
Analysis completed successfully...
" ... )
Source code in src/deriva_ml/core/base.py
936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 | |
add_term
add_term(
table: str | Table,
term_name: str,
description: str,
synonyms: list[str] | None = None,
exists_ok: bool = True,
) -> VocabularyTermHandle
Adds a term to a vocabulary table.
Creates a new standardized term with description and optional synonyms in a vocabulary table. Can either create a new term or return an existing one if it already exists.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Vocabulary table to add term to (name or Table object). |
required |
term_name
|
str
|
Primary name of the term (must be unique within vocabulary). |
required |
description
|
str
|
Explanation of term's meaning and usage. |
required |
synonyms
|
list[str] | None
|
Alternative names for the term. |
None
|
exists_ok
|
bool
|
If True, return the existing term if found. If False, raise error. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
VocabularyTermHandle |
VocabularyTermHandle
|
Object representing the created or existing term, with methods to modify it in the catalog. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If a term exists and exists_ok=False, or if the table is not a vocabulary table. |
Examples:
Add a new tissue type: >>> term = ml.add_term( ... table="tissue_types", ... term_name="epithelial", ... description="Epithelial tissue type", ... synonyms=["epithelium"] ... ) >>> # Modify the term >>> term.description = "Updated description" >>> term.synonyms = ("epithelium", "epithelial_tissue")
Attempt to add an existing term: >>> term = ml.add_term("tissue_types", "epithelial", "...", exists_ok=True)
Source code in src/deriva_ml/core/mixins/vocabulary.py
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 | |
add_visible_column
add_visible_column(
table: str | Table,
context: str,
column: str
| list[str]
| dict[str, Any],
position: int | None = None,
) -> list[Any]
Add a column to the visible-columns list for a specific context.
Convenience method for adding columns without replacing the entire visible-columns annotation. Changes are staged until apply_annotations() is called.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Table name or Table object. |
required |
context
|
str
|
The context to modify (e.g., "compact", "detailed", "entry"). |
required |
column
|
str | list[str] | dict[str, Any]
|
Column to add. Can be: - String: column name (e.g., "Filename") - List: foreign key reference (e.g., ["schema", "fkey_name"]) - Dict: pseudo-column definition |
required |
position
|
int | None
|
Position to insert at (0-indexed). If None, appends to end. |
None
|
Returns:
| Type | Description |
|---|---|
list[Any]
|
The updated column list for the context. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If context references another context. |
Example
ml.add_visible_column("Image", "compact", "Description") ml.add_visible_column("Image", "detailed", ["domain", "Image_Subject_fkey"], 1) ml.apply_annotations()
Source code in src/deriva_ml/core/mixins/annotation.py
390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 | |
add_visible_foreign_key
add_visible_foreign_key(
table: str | Table,
context: str,
foreign_key: list[str]
| dict[str, Any],
position: int | None = None,
) -> list[Any]
Add a foreign key to the visible-foreign-keys list for a specific context.
Convenience method for adding related tables without replacing the entire visible-foreign-keys annotation. Changes are staged until apply_annotations() is called.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Table name or Table object. |
required |
context
|
str
|
The context to modify (typically "detailed" or "*"). |
required |
foreign_key
|
list[str] | dict[str, Any]
|
Foreign key to add. Can be: - List: inbound foreign key reference (e.g., ["schema", "Other_Table_fkey"]) - Dict: pseudo-column definition for complex relationships |
required |
position
|
int | None
|
Position to insert at (0-indexed). If None, appends to end. |
None
|
Returns:
| Type | Description |
|---|---|
list[Any]
|
The updated foreign key list for the context. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If context references another context. |
Example
ml.add_visible_foreign_key("Subject", "detailed", ["domain", "Image_Subject_fkey"]) ml.apply_annotations()
Source code in src/deriva_ml/core/mixins/annotation.py
610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 | |
add_workflow
add_workflow(workflow: Workflow) -> RID
Adds a workflow to the catalog.
Registers a new workflow in the catalog or returns the RID of an existing workflow with the same URL or checksum.
Each workflow represents a specific computational process or analysis pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
workflow
|
Workflow
|
Workflow object containing name, URL, type, version, and description. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
RID |
RID
|
Resource Identifier of the added or existing workflow. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If workflow insertion fails or required fields are missing. |
Examples:
>>> workflow = Workflow(
... name="Gene Analysis",
... url="https://github.com/org/repo/workflows/gene_analysis.py",
... workflow_type="python_script",
... version="1.0.0",
... description="Analyzes gene expression patterns"
... )
>>> workflow_rid = ml.add_workflow(workflow)
Source code in src/deriva_ml/core/mixins/workflow.py
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | |
apply_annotations
apply_annotations() -> None
Apply all staged annotation changes to the catalog.
Commits any annotation changes made via set_display_annotation, set_visible_columns, set_visible_foreign_keys, set_table_display, or set_column_display to the remote catalog.
Example
ml.set_display_annotation("Image", {"name": "Images"}) ml.set_visible_columns("Image", {"compact": ["RID", "Filename"]}) ml.apply_annotations() # Commit all changes
Source code in src/deriva_ml/core/mixins/annotation.py
315 316 317 318 319 320 321 322 323 324 325 326 327 | |
apply_catalog_annotations
apply_catalog_annotations(
navbar_brand_text: str = "ML Data Browser",
head_title: str = "Catalog ML",
) -> None
Apply catalog-level annotations including the navigation bar and display settings.
This method configures the Chaise web interface for the catalog. Chaise is Deriva's web-based data browser that provides a user-friendly interface for exploring and managing catalog data. This method sets up annotations that control how Chaise displays and organizes the catalog.
Navigation Bar Structure: The method creates a navigation bar with the following menus: - User Info: Links to Users, Groups, and RID Lease tables - Deriva-ML: Core ML tables (Workflow, Execution, Dataset, Dataset_Version, etc.) - WWW: Web content tables (Page, File) - {Domain Schema}: All domain-specific tables (excludes vocabularies and associations) - Vocabulary: All controlled vocabulary tables from both ML and domain schemas - Assets: All asset tables from both ML and domain schemas - Features: All feature tables with entries named "TableName:FeatureName" - Catalog Registry: Link to the ermrest registry - Documentation: Links to ML notebook instructions and Deriva-ML docs
Display Settings: - Underscores in table/column names displayed as spaces - System columns (RID) shown in compact and entry views - Default table set to Dataset - Faceting and record deletion enabled - Export configurations available to all users
Bulk Upload Configuration: Configures upload patterns for asset tables, enabling drag-and-drop file uploads through the Chaise interface.
Call this after creating the domain schema and all tables to initialize the catalog's web interface. The navigation menus are dynamically built based on the current schema structure, automatically organizing tables into appropriate categories.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
navbar_brand_text
|
str
|
Text displayed in the navigation bar brand area. |
'ML Data Browser'
|
head_title
|
str
|
Title displayed in the browser tab. |
'Catalog ML'
|
Example
ml = DerivaML('deriva.example.org', 'my_catalog')
After creating domain schema and tables...
ml.apply_catalog_annotations()
Or with custom branding:
ml.apply_catalog_annotations("My Project Browser", "My ML Project")
Source code in src/deriva_ml/core/base.py
699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 | |
asset_record_class
asset_record_class(
asset_table_name: str,
) -> type
Create a dynamically generated Pydantic model for an asset table's metadata.
The returned class is a subclass of AssetRecord with fields derived from the asset table's metadata columns (non-system, non-standard-asset columns). Fields are typed according to their database column type, and nullable columns are Optional.
Follows the same pattern as Feature.feature_record_class().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
asset_table_name
|
str
|
Name of the asset table (e.g., "Image", "Model"). |
required |
Returns:
| Type | Description |
|---|---|
type
|
An AssetRecord subclass with validated fields matching the table's metadata. |
Example
ImageAsset = ml.asset_record_class("Image") record = ImageAsset(Subject="2-DEF", Acquisition_Date="2026-01-15") path = exe.asset_file_path("Image", "scan.jpg", metadata=record)
Source code in src/deriva_ml/core/mixins/asset.py
386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 | |
bag_info
bag_info(
dataset: "DatasetSpec",
) -> dict[str, Any]
Get comprehensive info about a dataset bag: size, contents, and cache status.
Combines the size estimate with local cache status. Use this to decide whether to prefetch a bag before running an experiment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
'DatasetSpec'
|
Specification of the dataset, including version and optional exclude_tables. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict with keys: - tables: dict mapping table name to {row_count, is_asset, asset_bytes} - total_rows, total_asset_bytes, total_asset_size - cache_status: one of "not_cached", "cached_metadata_only", "cached_materialized", "cached_incomplete" - cache_path: local path to cached bag (if cached), else None |
Source code in src/deriva_ml/core/mixins/dataset.py
283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 | |
cache_dataset
cache_dataset(
dataset: "DatasetSpec",
materialize: bool = True,
) -> dict[str, Any]
Download a dataset bag into the local cache without creating an execution.
Use this to warm the cache before running experiments. No execution or provenance records are created.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
'DatasetSpec'
|
Specification of the dataset, including version and optional exclude_tables. |
required |
materialize
|
bool
|
If True (default), download all asset files. If False, download only table metadata. |
True
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict with bag_info results after caching. |
Source code in src/deriva_ml/core/mixins/dataset.py
312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 | |
cache_features
cache_features(
table_name: str,
feature_name: str,
force: bool = False,
**kwargs,
) -> "pd.DataFrame"
Fetch feature values from the catalog and cache locally.
On first call, fetches all feature values and stores in the working data cache. Subsequent calls return cached data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_name
|
str
|
Table the feature is attached to (e.g., "Image"). |
required |
feature_name
|
str
|
Name of the feature (e.g., "Classification"). |
required |
force
|
bool
|
If True, re-fetch even if already cached. |
False
|
**kwargs
|
Additional arguments passed to |
{}
|
Returns:
| Type | Description |
|---|---|
'pd.DataFrame'
|
DataFrame with feature value records. |
Example::
labels = ml.cache_features("Image", "Classification")
print(labels["Diagnosis_Type"].value_counts())
Source code in src/deriva_ml/core/base.py
490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 | |
cache_table
cache_table(
table_name: str, force: bool = False
) -> "pd.DataFrame"
Fetch a table from the catalog and cache locally as SQLite.
On first call, fetches all rows from the catalog and stores in the
working data cache. Subsequent calls return the cached data without
contacting the catalog. Use force=True to re-fetch.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_name
|
str
|
Name of the table to fetch (e.g., "Subject", "Image"). |
required |
force
|
bool
|
If True, re-fetch even if already cached. |
False
|
Returns:
| Type | Description |
|---|---|
'pd.DataFrame'
|
DataFrame with the table contents. |
Example::
subjects = ml.cache_table("Subject")
print(f"{len(subjects)} subjects")
# Second call returns cached data instantly
subjects = ml.cache_table("Subject")
Source code in src/deriva_ml/core/base.py
459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 | |
catalog_snapshot
catalog_snapshot(
version_snapshot: str,
) -> Self
Return a new DerivaML instance connected to a specific catalog snapshot.
Catalog snapshots provide a read-only, point-in-time view of the catalog. The snapshot identifier is typically obtained from a dataset version record.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
version_snapshot
|
str
|
Snapshot identifier string (e.g., |
required |
Returns:
| Type | Description |
|---|---|
Self
|
A new DerivaML instance connected to the specified catalog snapshot. |
Source code in src/deriva_ml/core/base.py
390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 | |
chaise_url
chaise_url(
table: RID | Table | str,
) -> str
Generates Chaise web interface URL.
Chaise is Deriva's web interface for data exploration. This method creates a URL that directly links to the specified table or record.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
RID | Table | str
|
Table to generate URL for (name, Table object, or RID). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
URL in format: https://{host}/chaise/recordset/#{catalog}/{schema}:{table} |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If table or RID cannot be found. |
Examples:
Using table name: >>> ml.chaise_url("experiment_table") 'https://deriva.org/chaise/recordset/#1/schema:experiment_table'
Using RID: >>> ml.chaise_url("1-abc123")
Source code in src/deriva_ml/core/base.py
567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 | |
cite
cite(
entity: Dict[str, Any] | str,
current: bool = False,
) -> str
Generates citation URL for an entity.
Creates a URL that can be used to reference a specific entity in the catalog. By default, includes the catalog snapshot time to ensure version stability (permanent citation). With current=True, returns a URL to the current state.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity
|
Dict[str, Any] | str
|
Either a RID string or a dictionary containing entity data with a 'RID' key. |
required |
current
|
bool
|
If True, return URL to current catalog state (no snapshot). If False (default), return permanent citation URL with snapshot time. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Citation URL. Format depends on |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If an entity doesn't exist or lacks a RID. |
Examples:
Permanent citation (default): >>> url = ml.cite("1-abc123") >>> print(url) 'https://deriva.org/id/1/1-abc123@2024-01-01T12:00:00'
Current catalog URL: >>> url = ml.cite("1-abc123", current=True) >>> print(url) 'https://deriva.org/id/1/1-abc123'
Using a dictionary: >>> url = ml.cite({"RID": "1-abc123"})
Source code in src/deriva_ml/core/base.py
599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 | |
clean_execution_dirs
clean_execution_dirs(
older_than_days: int | None = None,
exclude_rids: list[str]
| None = None,
) -> dict[str, int]
Clean up execution working directories.
Removes execution output directories from the local working directory. Use this to free up disk space from completed or orphaned executions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
older_than_days
|
int | None
|
If provided, only remove directories older than this many days. If None, removes all execution directories (except excluded). |
None
|
exclude_rids
|
list[str] | None
|
List of execution RIDs to preserve (never remove). |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, int]
|
dict with keys: - 'dirs_removed': Number of directories removed - 'bytes_freed': Total bytes freed - 'errors': Number of removal errors |
Example
ml = DerivaML('deriva.example.org', 'my_catalog')
Clean all execution dirs older than 30 days
result = ml.clean_execution_dirs(older_than_days=30) print(f"Freed {result['bytes_freed'] / 1e9:.2f} GB")
Clean all except specific executions
result = ml.clean_execution_dirs(exclude_rids=['1-ABC', '1-DEF'])
Source code in src/deriva_ml/core/base.py
1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 | |
clear_cache
clear_cache(
older_than_days: int | None = None,
) -> dict[str, int]
Clear the dataset cache directory.
Removes cached dataset bags from the cache directory. Can optionally filter by age to only remove old cache entries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
older_than_days
|
int | None
|
If provided, only remove cache entries older than this many days. If None, removes all cache entries. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, int]
|
dict with keys: - 'files_removed': Number of files removed - 'dirs_removed': Number of directories removed - 'bytes_freed': Total bytes freed - 'errors': Number of removal errors |
Example
ml = DerivaML('deriva.example.org', 'my_catalog')
Clear all cache
result = ml.clear_cache() print(f"Freed {result['bytes_freed'] / 1e6:.1f} MB")
Clear cache older than 7 days
result = ml.clear_cache(older_than_days=7)
Source code in src/deriva_ml/core/base.py
1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 | |
clear_vocabulary_cache
clear_vocabulary_cache(
table: str | Table | None = None,
) -> None
Clear the vocabulary term cache.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table | None
|
If provided, only clear cache for this specific vocabulary table. If None, clear the entire cache. |
None
|
Source code in src/deriva_ml/core/mixins/vocabulary.py
64 65 66 67 68 69 70 71 72 73 74 75 76 77 | |
create_asset
create_asset(
asset_name: str,
column_defs: Iterable[
ColumnDefinition
]
| None = None,
fkey_defs: Iterable[
ColumnDefinition
]
| None = None,
referenced_tables: Iterable[Table]
| None = None,
comment: str = "",
schema: str | None = None,
update_navbar: bool = True,
) -> Table
Creates an asset table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
asset_name
|
str
|
Name of the asset table. |
required |
column_defs
|
Iterable[ColumnDefinition] | None
|
Iterable of ColumnDefinition objects to provide additional metadata for asset. |
None
|
fkey_defs
|
Iterable[ColumnDefinition] | None
|
Iterable of ForeignKeyDefinition objects to provide additional metadata for asset. |
None
|
referenced_tables
|
Iterable[Table] | None
|
Iterable of Table objects to which asset should provide foreign-key references to. |
None
|
comment
|
str
|
Description of the asset table. (Default value = '') |
''
|
schema
|
str | None
|
Schema in which to create the asset table. Defaults to domain_schema. |
None
|
update_navbar
|
bool
|
If True (default), automatically updates the navigation bar to include the new asset table. Set to False during batch asset creation to avoid redundant updates, then call apply_catalog_annotations() once at the end. |
True
|
Returns:
| Type | Description |
|---|---|
Table
|
Table object for the asset table. |
Source code in src/deriva_ml/core/mixins/asset.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | |
create_execution
create_execution(
configuration: ExecutionConfiguration,
workflow: "Workflow | RID | None" = None,
dry_run: bool = False,
) -> "Execution"
Create an execution environment.
Initializes a local compute environment for executing an ML or analytic routine. This has several side effects:
- Downloads datasets specified in the configuration to the cache directory. If no version is specified, creates a new minor version for the dataset.
- Downloads any execution assets to the working directory.
- Creates an execution record in the catalog (unless dry_run=True).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
configuration
|
ExecutionConfiguration
|
ExecutionConfiguration specifying execution parameters. |
required |
workflow
|
'Workflow | RID | None'
|
Optional Workflow object or RID if not present in configuration. |
None
|
dry_run
|
bool
|
If True, skip creating catalog records and uploading results. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
Execution |
'Execution'
|
An execution object for managing the execution lifecycle. |
Example
config = ExecutionConfiguration( ... workflow=workflow, ... description="Process samples", ... datasets=[DatasetSpec(rid="4HM")], ... ) with ml.create_execution(config) as execution: ... # Run analysis ... pass execution.upload_execution_outputs()
Source code in src/deriva_ml/core/mixins/execution.py
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | |
create_feature
create_feature(
target_table: Table | str,
feature_name: str,
terms: list[Table | str]
| None = None,
assets: list[Table | str]
| None = None,
metadata: list[
ColumnDefinition
| Table
| Key
| str
]
| None = None,
optional: list[str] | None = None,
comment: str = "",
update_navbar: bool = True,
) -> type[FeatureRecord]
Creates a new feature definition.
A feature represents a measurable property or characteristic that can be associated with records in the target table. Features can include vocabulary terms, asset references, and additional metadata.
Side Effects: This method dynamically creates: 1. A new association table in the domain schema to store feature values 2. A Pydantic model class (subclass of FeatureRecord) for creating validated feature instances
The returned Pydantic model class provides type-safe construction of feature records with automatic validation of values against the feature's definition (vocabulary terms, asset references, etc.). Use this class to create feature instances that can be inserted into the catalog.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target_table
|
Table | str
|
Table to associate the feature with (name or Table object). |
required |
feature_name
|
str
|
Unique name for the feature within the target table. |
required |
terms
|
list[Table | str] | None
|
Optional vocabulary tables/names whose terms can be used as feature values. |
None
|
assets
|
list[Table | str] | None
|
Optional asset tables/names that can be referenced by this feature. |
None
|
metadata
|
list[ColumnDefinition | Table | Key | str] | None
|
Optional columns, tables, or keys to include in a feature definition. |
None
|
optional
|
list[str] | None
|
Column names that are not required when creating feature instances. |
None
|
comment
|
str
|
Description of the feature's purpose and usage. |
''
|
update_navbar
|
bool
|
If True (default), automatically updates the navigation bar to include the new feature table. Set to False during batch feature creation to avoid redundant updates, then call apply_catalog_annotations() once at the end. |
True
|
Returns:
| Type | Description |
|---|---|
type[FeatureRecord]
|
type[FeatureRecord]: A dynamically generated Pydantic model class for creating validated feature instances. The class has fields corresponding to the feature's terms, assets, and metadata columns. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If a feature definition is invalid or conflicts with existing features. |
Examples:
Create a feature with confidence score: >>> DiagnosisFeature = ml.create_feature( ... target_table="Image", ... feature_name="Diagnosis", ... terms=["Diagnosis_Type"], ... metadata=[ColumnDefinition(name="confidence", type=BuiltinTypes.float4)], ... comment="Clinical diagnosis label" ... ) >>> # Use the returned class to create validated feature instances >>> record = DiagnosisFeature( ... Image="1-ABC", # Target record RID ... Diagnosis_Type="Normal", # Vocabulary term ... confidence=0.95, ... Execution="2-XYZ" # Execution that produced this value ... )
Source code in src/deriva_ml/core/mixins/feature.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 | |
create_table
create_table(
table: TableDefinition,
schema: str | None = None,
update_navbar: bool = True,
) -> Table
Creates a new table in the domain schema.
Creates a table using the provided TableDefinition object, which specifies the table structure including columns, keys, and foreign key relationships. The table is created in the domain schema associated with this DerivaML instance.
Required Classes: Import the following classes from deriva_ml to define tables:
TableDefinition: Defines the complete table structureColumnDefinition: Defines individual columns with types and constraintsKeyDefinition: Defines unique key constraints (optional)ForeignKeyDefinition: Defines foreign key relationships to other tables (optional)BuiltinTypes: Enum of available column data types
Available Column Types (BuiltinTypes enum):
text, int2, int4, int8, float4, float8, boolean,
date, timestamp, timestamptz, json, jsonb, markdown,
ermrest_uri, ermrest_rid, ermrest_rcb, ermrest_rmb,
ermrest_rct, ermrest_rmt
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
TableDefinition
|
A TableDefinition object containing the complete specification of the table to create. |
required |
update_navbar
|
bool
|
If True (default), automatically updates the navigation bar to include the new table. Set to False during batch table creation to avoid redundant updates, then call apply_catalog_annotations() once at the end. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
Table |
Table
|
The newly created ERMRest table object. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If table creation fails or the definition is invalid. |
Examples:
Simple table with basic columns:
>>> from deriva_ml import TableDefinition, ColumnDefinition, BuiltinTypes
>>>
>>> table_def = TableDefinition(
... name="Experiment",
... column_defs=[
... ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
... ColumnDefinition(name="Date", type=BuiltinTypes.date),
... ColumnDefinition(name="Description", type=BuiltinTypes.markdown),
... ColumnDefinition(name="Score", type=BuiltinTypes.float4),
... ],
... comment="Records of experimental runs"
... )
>>> experiment_table = ml.create_table(table_def)
Table with foreign key to another table:
>>> from deriva_ml import (
... TableDefinition, ColumnDefinition, ForeignKeyDefinition, BuiltinTypes
... )
>>>
>>> # Create a Sample table that references Subject
>>> sample_def = TableDefinition(
... name="Sample",
... column_defs=[
... ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
... ColumnDefinition(name="Subject", type=BuiltinTypes.text, nullok=False),
... ColumnDefinition(name="Collection_Date", type=BuiltinTypes.date),
... ],
... fkey_defs=[
... ForeignKeyDefinition(
... colnames=["Subject"],
... pk_sname=ml.default_schema, # Schema of referenced table
... pk_tname="Subject", # Name of referenced table
... pk_colnames=["RID"], # Column(s) in referenced table
... on_delete="CASCADE", # Delete samples when subject deleted
... )
... ],
... comment="Biological samples collected from subjects"
... )
>>> sample_table = ml.create_table(sample_def)
Table with unique key constraint:
>>> from deriva_ml import (
... TableDefinition, ColumnDefinition, KeyDefinition, BuiltinTypes
... )
>>>
>>> protocol_def = TableDefinition(
... name="Protocol",
... column_defs=[
... ColumnDefinition(name="Name", type=BuiltinTypes.text, nullok=False),
... ColumnDefinition(name="Version", type=BuiltinTypes.text, nullok=False),
... ColumnDefinition(name="Description", type=BuiltinTypes.markdown),
... ],
... key_defs=[
... KeyDefinition(
... colnames=["Name", "Version"],
... constraint_names=[["myschema", "Protocol_Name_Version_key"]],
... comment="Each protocol name+version must be unique"
... )
... ],
... comment="Experimental protocols with versioning"
... )
>>> protocol_table = ml.create_table(protocol_def)
Batch creation without navbar updates:
>>> ml.create_table(table1_def, update_navbar=False)
>>> ml.create_table(table2_def, update_navbar=False)
>>> ml.create_table(table3_def, update_navbar=False)
>>> ml.apply_catalog_annotations() # Update navbar once at the end
Source code in src/deriva_ml/core/base.py
1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 | |
create_vocabulary
create_vocabulary(
vocab_name: str,
comment: str = "",
schema: str | None = None,
update_navbar: bool = True,
) -> Table
Creates a controlled vocabulary table.
A controlled vocabulary table maintains a list of standardized terms and their definitions. Each term can have synonyms and descriptions to ensure consistent terminology usage across the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_name
|
str
|
Name for the new vocabulary table. Must be a valid SQL identifier. |
required |
comment
|
str
|
Description of the vocabulary's purpose and usage. Defaults to empty string. |
''
|
schema
|
str | None
|
Schema name to create the table in. If None, uses domain_schema. |
None
|
update_navbar
|
bool
|
If True (default), automatically updates the navigation bar to include the new vocabulary table. Set to False during batch table creation to avoid redundant updates, then call apply_catalog_annotations() once at the end. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
Table |
Table
|
ERMRest table object representing the newly created vocabulary table. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If vocab_name is invalid or already exists. |
Examples:
Create a vocabulary for tissue types:
>>> table = ml.create_vocabulary(
... vocab_name="tissue_types",
... comment="Standard tissue classifications",
... schema="bio_schema"
... )
Create multiple vocabularies without updating navbar until the end:
>>> ml.create_vocabulary("Species", update_navbar=False)
>>> ml.create_vocabulary("Tissue_Type", update_navbar=False)
>>> ml.apply_catalog_annotations() # Update navbar once
Source code in src/deriva_ml/core/base.py
962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 | |
create_workflow
create_workflow(
name: str,
workflow_type: str | list[str],
description: str = "",
) -> Workflow
Creates a new workflow definition.
Creates a Workflow object that represents a computational process or analysis pipeline. The workflow type(s) must be terms from the controlled vocabulary. This method is typically used to define new analysis workflows before execution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name of the workflow. |
required |
workflow_type
|
str | list[str]
|
Type(s) of workflow (must exist in workflow_type vocabulary). Can be a single string or a list of strings. |
required |
description
|
str
|
Description of what the workflow does. |
''
|
Returns:
| Name | Type | Description |
|---|---|---|
Workflow |
Workflow
|
New workflow object ready for registration. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If any workflow_type is not in the vocabulary. |
Examples:
>>> workflow = ml.create_workflow(
... name="RNA Analysis",
... workflow_type="python_notebook",
... description="RNA sequence analysis pipeline"
... )
>>> rid = ml.add_workflow(workflow)
Multiple types::
>>> workflow = ml.create_workflow(
... name="Training Pipeline",
... workflow_type=["Training", "Embedding"],
... description="Combined training and embedding pipeline"
... )
Source code in src/deriva_ml/core/mixins/workflow.py
318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 | |
define_association
define_association(
associates: list,
metadata: list | None = None,
table_name: str | None = None,
comment: str | None = None,
**kwargs,
) -> dict
Build an association table definition with vocab-aware key selection.
Creates a table definition that links two or more tables via an association (many-to-many) table. Non-vocabulary tables automatically use RID as the foreign key target, while vocabulary tables use their Name key.
Use with create_table() to create the association table in the catalog.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
associates
|
list
|
Tables to associate. Each item can be: - A Table object - A (name, Table) tuple to customize the column name - A (name, nullok, Table) tuple for nullable references - A Key object for explicit key selection |
required |
metadata
|
list | None
|
Additional metadata columns or reference targets. |
None
|
table_name
|
str | None
|
Name for the association table. Auto-generated if omitted. |
None
|
comment
|
str | None
|
Comment for the association table. |
None
|
**kwargs
|
Additional arguments passed to Table.define_association. |
{}
|
Returns:
| Type | Description |
|---|---|
dict
|
Table definition dict suitable for |
Example::
# Associate Image with Subject (many-to-many)
image_table = ml.model.name_to_table("Image")
subject_table = ml.model.name_to_table("Subject")
assoc_def = ml.define_association(
associates=[image_table, subject_table],
comment="Links images to subjects",
)
ml.create_table(assoc_def)
Source code in src/deriva_ml/core/base.py
1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 | |
delete_dataset
delete_dataset(
dataset: "Dataset",
recurse: bool = False,
) -> None
Soft-delete a dataset by marking it as deleted in the catalog.
Sets the Deleted flag on the dataset record. The dataset's data is
preserved but it will no longer appear in normal queries (e.g.,
find_datasets()). The dataset cannot be deleted if it is currently
nested inside a parent dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
Dataset
|
The dataset to delete. |
required |
recurse
|
bool
|
If True, also soft-delete all nested child datasets. If False (default), only this dataset is marked as deleted. |
False
|
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If the dataset RID is not a valid dataset, or if the dataset is nested inside a parent dataset. |
Source code in src/deriva_ml/core/mixins/dataset.py
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | |
delete_feature
delete_feature(
table: Table | str,
feature_name: str,
) -> bool
Removes a feature definition and its data.
Deletes the feature and its implementation table from the catalog. This operation cannot be undone and will remove all feature values associated with this feature.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
Table | str
|
The table containing the feature, either as name or Table object. |
required |
feature_name
|
str
|
Name of the feature to delete. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the feature was successfully deleted, False if it didn't exist. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If deletion fails due to constraints or permissions. |
Example
success = ml.delete_feature("samples", "obsolete_feature") print("Deleted" if success else "Not found")
Source code in src/deriva_ml/core/mixins/feature.py
237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 | |
delete_term
delete_term(
table: str | Table, term_name: str
) -> None
Delete a term from a vocabulary table.
Removes a term from the vocabulary. The term must not be in use by any records in the catalog (e.g., no datasets using this dataset type, no assets using this asset type).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Vocabulary table containing the term (name or Table object). |
required |
term_name
|
str
|
Primary name of the term to delete. |
required |
Raises:
| Type | Description |
|---|---|
DerivaMLInvalidTerm
|
If the term doesn't exist in the vocabulary. |
DerivaMLException
|
If the term is currently in use by other records. |
Example
ml.delete_term("Dataset_Type", "Obsolete_Type")
Source code in src/deriva_ml/core/mixins/vocabulary.py
372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 | |
domain_path
domain_path(
schema: str | None = None,
) -> datapath.DataPath
Returns path builder for a domain schema.
Provides a convenient way to access tables and construct queries within a domain-specific schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
schema
|
str | None
|
Schema name to get path builder for. If None, uses default_schema. |
None
|
Returns:
| Type | Description |
|---|---|
DataPath
|
datapath._CatalogWrapper: Path builder object scoped to the specified domain schema. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If no schema specified and default_schema is not set. |
Example
domain = ml.domain_path() # Uses default schema results = domain.my_table.entities().fetch()
Or with explicit schema:
domain = ml.domain_path("my_schema")
Source code in src/deriva_ml/core/mixins/path_builder.py
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | |
download_dataset_bag
download_dataset_bag(
dataset: DatasetSpec,
) -> "DatasetBag"
Downloads a dataset to the local filesystem.
Downloads a dataset specified by DatasetSpec to the local filesystem. If the catalog has s3_bucket configured and use_minid is enabled, the bag will be uploaded to S3 and registered with the MINID service.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
DatasetSpec
|
Specification of the dataset to download, including version and materialization options. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
DatasetBag |
'DatasetBag'
|
Object containing: - path: Local filesystem path to downloaded dataset - rid: Dataset's Resource Identifier - minid: Dataset's Minimal Viable Identifier (if MINID enabled) |
Note
MINID support requires s3_bucket to be configured when creating the DerivaML instance. The catalog's use_minid setting controls whether MINIDs are created.
Examples:
Download with default options: >>> spec = DatasetSpec(rid="1-abc123") >>> bag = ml.download_dataset_bag(dataset=spec) >>> print(f"Downloaded to {bag.path}")
Source code in src/deriva_ml/core/mixins/dataset.py
213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 | |
download_dir
download_dir(
cached: bool = False,
) -> Path
Returns the appropriate download directory.
Provides the appropriate directory path for storing downloaded files, either in the cache or working directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cached
|
bool
|
If True, returns the cache directory path. If False, returns the working directory path. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Directory path where downloaded files should be stored. |
Example
cache_dir = ml.download_dir(cached=True) work_dir = ml.download_dir(cached=False)
Source code in src/deriva_ml/core/base.py
416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 | |
estimate_bag_size
estimate_bag_size(
dataset: "DatasetSpec",
) -> dict[str, Any]
Estimate the size of a dataset bag before downloading.
Generates the same download specification used by download_dataset_bag, then runs COUNT and SUM(Length) queries against the snapshot catalog to preview what a download will contain and how large it will be.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
'DatasetSpec'
|
Specification of the dataset to estimate, including version and optional exclude_tables. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict with keys: - tables: dict mapping table name to {row_count, is_asset, asset_bytes} - total_rows: total row count across all tables - total_asset_bytes: total size of asset files in bytes - total_asset_size: human-readable size string (e.g., "1.2 GB") |
Source code in src/deriva_ml/core/mixins/dataset.py
254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 | |
feature_record_class
feature_record_class(
table: str | Table,
feature_name: str,
) -> type[FeatureRecord]
Returns a dynamically generated Pydantic model class for creating feature records.
Each feature has a unique set of columns based on its definition (terms, assets, metadata). This method returns a Pydantic class with fields corresponding to those columns, providing:
- Type validation: Values are validated against expected types (str, int, float, Path)
- Required field checking: Non-nullable columns must be provided
- Default values: Feature_Name is pre-filled with the feature's name
Field types in the generated class:
- {TargetTable} (str): Required. RID of the target record (e.g., Image RID)
- Execution (str, optional): RID of the execution for provenance tracking
- Feature_Name (str): Pre-filled with the feature name
- Term columns (str): Accept vocabulary term names
- Asset columns (str | Path): Accept asset RIDs or file paths
- Value columns: Accept values matching the column type (int, float, str)
Use lookup_feature() to inspect the feature's structure and see what columns
are available.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
The table containing the feature, either as name or Table object. |
required |
feature_name
|
str
|
Name of the feature to create a record class for. |
required |
Returns:
| Type | Description |
|---|---|
type[FeatureRecord]
|
type[FeatureRecord]: A Pydantic model class for creating validated feature records.
The class name follows the pattern |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If the feature doesn't exist or the table is invalid. |
Example
Get the dynamically generated class
DiagnosisFeature = ml.feature_record_class("Image", "Diagnosis")
Create a validated feature record
record = DiagnosisFeature( ... Image="1-ABC", # Target record RID ... Diagnosis_Type="Normal", # Vocabulary term ... confidence=0.95, # Metadata column ... Execution="2-XYZ" # Provenance ... )
Convert to dict for insertion
record.model_dump() {'Image': '1-ABC', 'Diagnosis_Type': 'Normal', 'confidence': 0.95, ...}
Source code in src/deriva_ml/core/mixins/feature.py
186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 | |
fetch_table_features
fetch_table_features(
table: Table | str,
feature_name: str | None = None,
selector: Callable[
[list[FeatureRecord]],
FeatureRecord,
]
| None = None,
) -> dict[str, list[FeatureRecord]]
Fetch all feature values for a table, grouped by feature name.
Returns a dictionary mapping feature names to lists of FeatureRecord instances. This is useful for retrieving all annotations on a table in a single call — for example, getting all classification labels, quality scores, and bounding boxes for a set of images at once.
Selector for resolving multiple values:
An asset may have multiple values for the same feature — for example,
labels from different annotators, or predictions from successive model
runs. When a selector is provided, records are grouped by target
RID and the selector is called once per group to pick a single value.
Groups with only one record are passed through unchanged.
A selector is any callable with signature
(list[FeatureRecord]) -> FeatureRecord. Built-in selectors:
FeatureRecord.select_newest— picks the record with the most recentRCT(Row Creation Time).
Custom selector example::
def select_highest_confidence(records):
return max(records, key=lambda r: getattr(r, "Confidence", 0))
For workflow-aware selection, see select_by_workflow().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
Table | str
|
The table to fetch features for (name or Table object). |
required |
feature_name
|
str | None
|
If provided, only fetch values for this specific
feature. If |
None
|
selector
|
Callable[[list[FeatureRecord]], FeatureRecord] | None
|
Optional function to select among multiple feature values for the same target object. Receives a list of FeatureRecord instances (all for the same target RID) and returns the selected one. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, list[FeatureRecord]]
|
dict[str, list[FeatureRecord]]: Keys are feature names, values are |
dict[str, list[FeatureRecord]]
|
lists of FeatureRecord instances. When a selector is provided, each |
dict[str, list[FeatureRecord]]
|
target object appears at most once per feature. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If a specified |
Examples:
Fetch all features for a table::
>>> features = ml.fetch_table_features("Image")
>>> for name, records in features.items():
... print(f"{name}: {len(records)} values")
Fetch a single feature with newest-value selection::
>>> features = ml.fetch_table_features(
... "Image",
... feature_name="Classification",
... selector=FeatureRecord.select_newest,
... )
Convert results to a DataFrame::
>>> features = ml.fetch_table_features("Image", feature_name="Quality")
>>> import pandas as pd
>>> df = pd.DataFrame([r.model_dump() for r in features["Quality"]])
Source code in src/deriva_ml/core/mixins/feature.py
383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 | |
find_assets
find_assets(
asset_table: Table
| str
| None = None,
asset_type: str | None = None,
) -> Iterable["Asset"]
Find assets in the catalog.
Returns an iterable of Asset objects matching the specified criteria. If no criteria are specified, returns all assets from all asset tables.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
asset_table
|
Table | str | None
|
Optional table or table name to search. If None, searches all asset tables. |
None
|
asset_type
|
str | None
|
Optional asset type to filter by. Only returns assets with this type. |
None
|
Returns:
| Type | Description |
|---|---|
Iterable['Asset']
|
Iterable of Asset objects matching the criteria. |
Example
Find all assets in the Model table
models = list(ml.find_assets(asset_table="Model"))
Find all assets with type "Training_Data"
training = list(ml.find_assets(asset_type="Training_Data"))
Find all assets across all tables
all_assets = list(ml.find_assets())
Source code in src/deriva_ml/core/mixins/asset.py
342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 | |
find_datasets
find_datasets(
deleted: bool = False,
) -> Iterable["Dataset"]
List all datasets in the catalog.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
deleted
|
bool
|
If True, include datasets that have been marked as deleted. |
False
|
Returns:
| Type | Description |
|---|---|
Iterable['Dataset']
|
Iterable of Dataset objects. |
Example
datasets = list(ml.find_datasets()) for ds in datasets: ... print(f"{ds.dataset_rid}: {ds.description}")
Source code in src/deriva_ml/core/mixins/dataset.py
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | |
find_experiments
find_experiments(
workflow_rid: RID | None = None,
status: Status | None = None,
) -> Iterable["Experiment"]
List all experiments (executions with Hydra configuration) in the catalog.
Creates Experiment objects for analyzing completed ML model runs. Only returns executions that have Hydra configuration metadata (i.e., a config.yaml file in Execution_Metadata assets).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
workflow_rid
|
RID | None
|
Optional workflow RID to filter by. |
None
|
status
|
Status | None
|
Optional status to filter by (e.g., Status.Completed). |
None
|
Returns:
| Type | Description |
|---|---|
Iterable['Experiment']
|
Iterable of Experiment objects for executions with Hydra config. |
Example
experiments = list(ml.find_experiments(status=Status.Completed)) for exp in experiments: ... print(f"{exp.name}: {exp.config_choices}")
Source code in src/deriva_ml/core/mixins/execution.py
352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 | |
find_features
find_features(
table: str | Table | None = None,
) -> list[Feature]
Find feature definitions in the schema.
Discovers features by inspecting the catalog schema for association tables
that have Feature_Name and Execution columns. Returns Feature objects
describing each feature's structure (target table, term/asset/value columns),
not the feature values themselves.
Use fetch_table_features or list_feature_values to retrieve actual
feature values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table | None
|
Optional table to find features for. If None, returns all feature definitions across all tables. |
None
|
Returns:
| Type | Description |
|---|---|
list[Feature]
|
A list of Feature instances describing the feature definitions. |
Examples:
Find all feature definitions: >>> all_features = ml.find_features() >>> for f in all_features: ... print(f"{f.target_table.name}.{f.feature_name}")
Find features defined on a specific table: >>> image_features = ml.find_features("Image") >>> print([f.feature_name for f in image_features])
Source code in src/deriva_ml/core/mixins/feature.py
315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 | |
find_workflows
find_workflows() -> list[Workflow]
Find all workflows in the catalog.
Catalog-level operation to find all workflow definitions, including their names, URLs, types, versions, and descriptions. Each returned Workflow is bound to the catalog, allowing its description to be updated.
Returns:
| Type | Description |
|---|---|
list[Workflow]
|
list[Workflow]: List of workflow objects, each containing: - name: Workflow name - url: Source code URL - workflow_type: Type(s) of workflow - version: Version identifier - description: Workflow description - rid: Resource identifier - checksum: Source code checksum |
Examples:
List all workflows and their descriptions::
>>> workflows = ml.find_workflows()
>>> for w in workflows:
... print(f"{w.name} (v{w.version}): {w.description}")
... print(f" Source: {w.url}")
Update a workflow's description (workflows are catalog-bound)::
>>> workflows = ml.find_workflows()
>>> workflows[0].description = "Updated description"
Source code in src/deriva_ml/core/mixins/workflow.py
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | |
from_context
classmethod
from_context(
path: Path | str | None = None,
) -> Self
Create a DerivaML instance from a .deriva-context.json file.
Searches for .deriva-context.json starting from path (default: cwd),
walking up parent directories. This enables scripts generated by Claude
to connect to the same catalog without hardcoding connection details.
The context file is written by the MCP server's connect_catalog tool
and contains hostname, catalog_id, and default_schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str | None
|
Starting directory to search for the context file. Defaults to the current working directory. |
None
|
Returns:
| Type | Description |
|---|---|
Self
|
A new DerivaML instance configured from the context file. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If no .deriva-context.json is found. |
Example::
# In a script generated by Claude:
from deriva_ml import DerivaML
ml = DerivaML.from_context()
subjects = ml.cache_table("Subject")
Source code in src/deriva_ml/core/base.py
166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | |
get_cache_size
get_cache_size() -> dict[
str, int | float
]
Get the current size of the cache directory.
Returns:
| Type | Description |
|---|---|
dict[str, int | float]
|
dict with keys: - 'total_bytes': Total size in bytes - 'total_mb': Total size in megabytes - 'file_count': Number of files - 'dir_count': Number of directories |
Example
ml = DerivaML('deriva.example.org', 'my_catalog') size = ml.get_cache_size() print(f"Cache size: {size['total_mb']:.1f} MB ({size['file_count']} files)")
Source code in src/deriva_ml/core/base.py
1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 | |
get_column_annotations
get_column_annotations(
table: str | Table, column_name: str
) -> dict[str, Any]
Get all display-related annotations for a column.
Returns the current values of display and column-display annotations for the specified column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Table name or Table object containing the column. |
required |
column_name
|
str
|
Name of the column. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with keys: table, column, display, column_display. |
dict[str, Any]
|
Missing annotations are None. |
Example
annotations = ml.get_column_annotations("Image", "Filename") print(annotations["display"])
Source code in src/deriva_ml/core/mixins/annotation.py
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | |
get_handlebars_template_variables
get_handlebars_template_variables(
table: str | Table,
) -> dict[str, Any]
Get all available template variables for a table.
Returns the columns, foreign keys, and special variables that can be used in Handlebars templates (row_markdown_pattern, markdown_pattern, etc.) for the specified table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Table name or Table object. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with columns, foreign_keys, special_variables, and helper_examples. |
Example
vars = ml.get_handlebars_template_variables("Image") for col in vars["columns"]: ... print(f"{col['name']}: {col['template']}")
Source code in src/deriva_ml/core/mixins/annotation.py
832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 | |
get_storage_summary
get_storage_summary() -> dict[str, any]
Get a summary of local storage usage.
Returns:
| Type | Description |
|---|---|
dict[str, any]
|
dict with keys: - 'working_dir': Path to working directory - 'cache_dir': Path to cache directory - 'cache_size_mb': Cache size in MB - 'cache_file_count': Number of files in cache - 'execution_dir_count': Number of execution directories - 'execution_size_mb': Total size of execution directories in MB - 'total_size_mb': Combined size in MB |
Example
ml = DerivaML('deriva.example.org', 'my_catalog') summary = ml.get_storage_summary() print(f"Total storage: {summary['total_size_mb']:.1f} MB") print(f" Cache: {summary['cache_size_mb']:.1f} MB") print(f" Executions: {summary['execution_size_mb']:.1f} MB")
Source code in src/deriva_ml/core/base.py
1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 | |
get_table_annotations
get_table_annotations(
table: str | Table,
) -> dict[str, Any]
Get all display-related annotations for a table.
Returns the current values of display, visible-columns, visible-foreign-keys, and table-display annotations for the specified table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Table name or Table object. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with keys: table, schema, display, visible_columns, |
dict[str, Any]
|
visible_foreign_keys, table_display. Missing annotations are None. |
Example
annotations = ml.get_table_annotations("Image") print(annotations["visible_columns"])
Source code in src/deriva_ml/core/mixins/annotation.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | |
get_table_as_dataframe
get_table_as_dataframe(
table: str,
) -> pd.DataFrame
Get table contents as a pandas DataFrame.
Retrieves all contents of a table from the catalog.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str
|
Name of the table to retrieve. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing all table contents. |
Source code in src/deriva_ml/core/mixins/path_builder.py
119 120 121 122 123 124 125 126 127 128 129 130 | |
get_table_as_dict
get_table_as_dict(
table: str,
) -> Iterable[dict[str, Any]]
Get table contents as dictionaries.
Retrieves all contents of a table from the catalog.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str
|
Name of the table to retrieve. |
required |
Returns:
| Type | Description |
|---|---|
Iterable[dict[str, Any]]
|
Iterable yielding dictionaries for each row. |
Source code in src/deriva_ml/core/mixins/path_builder.py
132 133 134 135 136 137 138 139 140 141 142 143 144 145 | |
globus_login
staticmethod
globus_login(host: str) -> None
Authenticate with Globus to obtain credentials for a Deriva server.
Initiates a Globus Native Login flow to obtain OAuth2 tokens required by the Deriva server. The flow uses a device-code grant (no browser or local server), and stores refresh tokens so that subsequent calls can re-authenticate silently. The BDBag keychain is also updated so that bag downloads can use the same credentials.
If the user is already logged in for the given host, a message is printed and no further action is taken.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
host
|
str
|
Hostname of the Deriva server to authenticate with
(e.g., |
required |
Example
DerivaML.globus_login('www.eye-ai.org') 'Login Successful'
Source code in src/deriva_ml/core/base.py
533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 | |
instantiate
classmethod
instantiate(
config: DerivaMLConfig,
) -> Self
Create a DerivaML instance from a configuration object.
This method is the preferred way to instantiate DerivaML when using hydra-zen for configuration management. It accepts a DerivaMLConfig (Pydantic model) and unpacks it to create the instance.
This pattern allows hydra-zen's instantiate() to work with DerivaML:
Example with hydra-zen
from hydra_zen import builds, instantiate from deriva_ml import DerivaML from deriva_ml.core.config import DerivaMLConfig
Create a structured config using hydra-zen
DerivaMLConf = builds(DerivaMLConfig, populate_full_signature=True)
Configure for your environment
conf = DerivaMLConf( ... hostname='deriva.example.org', ... catalog_id='42', ... domain_schema='my_domain', ... )
Instantiate the config to get a DerivaMLConfig object
config = instantiate(conf)
Create the DerivaML instance
ml = DerivaML.instantiate(config)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
DerivaMLConfig
|
A DerivaMLConfig object containing all configuration parameters. |
required |
Returns:
| Type | Description |
|---|---|
Self
|
A new DerivaML instance configured according to the config object. |
Note
The DerivaMLConfig class integrates with Hydra's configuration system
and registers custom resolvers for computing working directories.
See deriva_ml.core.config for details on configuration options.
Source code in src/deriva_ml/core/base.py
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | |
is_snapshot
is_snapshot() -> bool
Check whether this DerivaML instance is connected to a catalog snapshot.
Returns:
| Type | Description |
|---|---|
bool
|
True if the underlying catalog has a snapshot timestamp, False otherwise. |
Source code in src/deriva_ml/core/base.py
382 383 384 385 386 387 388 | |
list_asset_executions
list_asset_executions(
asset_rid: str,
asset_role: str | None = None,
) -> list["ExecutionRecord"]
List all executions associated with an asset.
Given an asset RID, returns a list of executions that created or used the asset, along with the role (Input/Output) in each execution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
asset_rid
|
str
|
The RID of the asset to look up. |
required |
asset_role
|
str | None
|
Optional filter for asset role ('Input' or 'Output'). If None, returns all associations. |
None
|
Returns:
| Type | Description |
|---|---|
list['ExecutionRecord']
|
list[ExecutionRecord]: List of ExecutionRecord objects for the executions associated with this asset. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If the asset RID is not found or not an asset. |
Example
Find all executions that created this asset
executions = ml.list_asset_executions("1-abc123", asset_role="Output") for exe in executions: ... print(f"Created by execution {exe.execution_rid}")
Find all executions that used this asset as input
executions = ml.list_asset_executions("1-abc123", asset_role="Input")
Source code in src/deriva_ml/core/mixins/asset.py
199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 | |
list_asset_tables
list_asset_tables() -> list[Table]
List all asset tables in the catalog.
Returns:
| Type | Description |
|---|---|
list[Table]
|
List of Table objects that are asset tables. |
Example
for table in ml.list_asset_tables(): ... print(f"Asset table: {table.name}")
Source code in src/deriva_ml/core/mixins/asset.py
317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 | |
list_assets
list_assets(
asset_table: Table | str,
) -> list["Asset"]
Lists contents of an asset table.
Returns a list of Asset objects for the specified asset table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
asset_table
|
Table | str
|
Table or name of the asset table to list assets for. |
required |
Returns:
| Type | Description |
|---|---|
list['Asset']
|
list[Asset]: List of Asset objects for the assets in the table. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If the table is not an asset table or doesn't exist. |
Example
assets = ml.list_assets("Image") for asset in assets: ... print(f"{asset.asset_rid}: {asset.filename}")
Source code in src/deriva_ml/core/mixins/asset.py
139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 | |
list_dataset_element_types
list_dataset_element_types() -> (
Iterable[Table]
)
List the types of entities that can be added to a dataset.
Returns:
| Type | Description |
|---|---|
Iterable[Table]
|
An iterable of Table objects that can be included as an element of a dataset. |
Source code in src/deriva_ml/core/mixins/dataset.py
166 167 168 169 170 171 172 173 174 175 176 | |
list_execution_dirs
list_execution_dirs() -> list[
dict[str, any]
]
List execution working directories.
Returns information about each execution directory in the working directory, useful for identifying orphaned or incomplete execution outputs.
Returns:
| Type | Description |
|---|---|
list[dict[str, any]]
|
List of dicts, each containing: - 'execution_rid': The execution RID (directory name) - 'path': Full path to the directory - 'size_bytes': Total size in bytes - 'size_mb': Total size in megabytes - 'modified': Last modification time (datetime) - 'file_count': Number of files |
Example
ml = DerivaML('deriva.example.org', 'my_catalog') dirs = ml.list_execution_dirs() for d in dirs: ... print(f"{d['execution_rid']}: {d['size_mb']:.1f} MB")
Source code in src/deriva_ml/core/base.py
1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 | |
list_feature_values
list_feature_values(
table: Table | str,
feature_name: str,
selector: Callable[
[list[FeatureRecord]],
FeatureRecord,
]
| None = None,
) -> Iterable[FeatureRecord]
Retrieve all values for a single feature as typed FeatureRecord instances.
Convenience wrapper around fetch_table_features() for the common
case of querying a single feature by name. Returns a flat list of
FeatureRecord objects — one per feature value (or one per target object
when a selector is provided).
Each returned record is a dynamically-generated Pydantic model with
typed fields matching the feature's definition. For example, an
Image_Classification feature might produce records with fields
Image (str), Image_Class (str), Execution (str),
RCT (str), and Feature_Name (str).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
Table | str
|
The table the feature is defined on (name or Table object). |
required |
feature_name
|
str
|
Name of the feature to retrieve values for. |
required |
selector
|
Callable[[list[FeatureRecord]], FeatureRecord] | None
|
Optional function to resolve multiple values per target.
See |
None
|
Returns:
| Type | Description |
|---|---|
Iterable[FeatureRecord]
|
Iterable[FeatureRecord]: FeatureRecord instances with: |
Iterable[FeatureRecord]
|
|
Iterable[FeatureRecord]
|
|
Iterable[FeatureRecord]
|
|
Iterable[FeatureRecord]
|
|
Iterable[FeatureRecord]
|
|
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If the feature doesn't exist on the table. |
Examples:
Get typed feature records::
>>> for record in ml.list_feature_values("Image", "Quality"):
... print(f"Image {record.Image}: {record.ImageQuality}")
... print(f"Created by execution: {record.Execution}")
Select newest when multiple values exist::
>>> records = list(ml.list_feature_values(
... "Image", "Quality",
... selector=FeatureRecord.select_newest,
... ))
Convert to a list of dicts::
>>> dicts = [r.model_dump() for r in
... ml.list_feature_values("Image", "Classification")]
Source code in src/deriva_ml/core/mixins/feature.py
502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 | |
list_files
list_files(
file_types: list[str] | None = None,
) -> list[dict[str, Any]]
Lists files in the catalog with their metadata.
Returns a list of files with their metadata including URL, MD5 hash, length, description, and associated file types. Files can be optionally filtered by type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_types
|
list[str] | None
|
Filter results to only include these file types. |
None
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: List of file records, each containing: - RID: Resource identifier - URL: File location - MD5: File hash - Length: File size - Description: File description - File_Types: List of associated file types |
Examples:
List all files: >>> files = ml.list_files() >>> for f in files: ... print(f"{f['RID']}: {f['URL']}")
Filter by file type: >>> image_files = ml.list_files(["image", "png"])
Source code in src/deriva_ml/core/mixins/file.py
211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 | |
list_foreign_keys
list_foreign_keys(
table: str | Table,
) -> dict[str, Any]
List all foreign keys related to a table.
Returns both outbound foreign keys (from this table to others) and inbound foreign keys (from other tables to this one). Useful for determining valid constraint names for visible-columns and visible-foreign-keys annotations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Table name or Table object. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
Each foreign key contains constraint_name, from_table, from_columns, |
dict[str, Any]
|
to_table, to_columns. |
Example
fkeys = ml.list_foreign_keys("Image") for fk in fkeys["outbound"]: ... print(f"{fk['constraint_name']} -> {fk['to_table']}")
Source code in src/deriva_ml/core/mixins/annotation.py
333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 | |
list_vocabulary_terms
list_vocabulary_terms(
table: str | Table,
) -> list[VocabularyTerm]
Lists all terms in a vocabulary table.
Retrieves all terms, their descriptions, and synonyms from a controlled vocabulary table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Vocabulary table to list terms from (name or Table object). |
required |
Returns:
| Type | Description |
|---|---|
list[VocabularyTerm]
|
list[VocabularyTerm]: List of vocabulary terms with their metadata. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If table doesn't exist or is not a vocabulary table. |
Examples:
>>> terms = ml.list_vocabulary_terms("tissue_types")
>>> for term in terms:
... print(f"{term.name}: {term.description}")
... if term.synonyms:
... print(f" Synonyms: {', '.join(term.synonyms)}")
Source code in src/deriva_ml/core/mixins/vocabulary.py
294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 | |
lookup_asset
lookup_asset(asset_rid: RID) -> 'Asset'
Look up an asset by its RID.
Returns an Asset object for the specified RID. The asset can be from any asset table in the catalog.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
asset_rid
|
RID
|
The RID of the asset to look up. |
required |
Returns:
| Type | Description |
|---|---|
'Asset'
|
Asset object for the specified RID. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If the RID is not found or is not an asset. |
Example
asset = ml.lookup_asset("3JSE") print(f"File: {asset.filename}, Table: {asset.asset_table}")
Source code in src/deriva_ml/core/mixins/asset.py
253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 | |
lookup_dataset
lookup_dataset(
dataset: RID | DatasetSpec,
deleted: bool = False,
) -> "Dataset"
Look up a dataset by RID or DatasetSpec.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
RID | DatasetSpec
|
Dataset RID or DatasetSpec to look up. |
required |
deleted
|
bool
|
If True, include datasets that have been marked as deleted. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
Dataset |
'Dataset'
|
The dataset object for the specified RID. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If the dataset is not found. |
Example
dataset = ml.lookup_dataset("4HM") print(f"Version: {dataset.current_version}")
Source code in src/deriva_ml/core/mixins/dataset.py
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | |
lookup_execution
lookup_execution(
execution_rid: RID,
) -> "ExecutionRecord"
Look up an execution by RID and return an ExecutionRecord.
Creates an ExecutionRecord object for querying and modifying execution metadata. The ExecutionRecord provides access to the catalog record state and allows updating mutable properties like status and description.
For running computations with datasets and assets, use restore_execution()
or create_execution() which return full Execution objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
execution_rid
|
RID
|
Resource Identifier (RID) of the execution. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
ExecutionRecord |
'ExecutionRecord'
|
An execution record object bound to the catalog. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If execution_rid is not valid or doesn't refer to an Execution record. |
Example
Look up an execution and query its state::
>>> record = ml.lookup_execution("1-abc123")
>>> print(f"Status: {record.status}")
>>> print(f"Description: {record.description}")
Update mutable properties::
>>> record.status = Status.completed
>>> record.description = "Analysis finished"
Query relationships::
>>> children = list(record.list_nested_executions())
>>> parents = list(record.list_parent_executions())
Source code in src/deriva_ml/core/mixins/execution.py
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | |
lookup_experiment
lookup_experiment(
execution_rid: RID,
) -> "Experiment"
Look up an experiment by execution RID.
Creates an Experiment object for analyzing completed executions. Provides convenient access to execution metadata, configuration choices, model parameters, inputs, and outputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
execution_rid
|
RID
|
Resource Identifier (RID) of the execution. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Experiment |
'Experiment'
|
An experiment object for the given execution RID. |
Example
exp = ml.lookup_experiment("47BE") print(exp.name) # e.g., "cifar10_quick" print(exp.config_choices) # Hydra config names used print(exp.model_config) # Model hyperparameters
Source code in src/deriva_ml/core/mixins/execution.py
329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 | |
lookup_feature
lookup_feature(
table: str | Table,
feature_name: str,
) -> Feature
Look up a feature definition by table and name.
Returns a Feature object that describes the schema structure
of a feature — not the feature values themselves. A Feature is a
schema-level descriptor derived by inspecting the catalog's
association tables. It tells you:
- What table the feature annotates (
target_table) — e.g., Image - Where values are stored (
feature_table) — the association table linking targets to values and executions -
What kind of values it holds, classified by column role:
-
term_columns: columns referencing controlled vocabulary tables (e.g., aDiagnosis_Typecolumn pointing to a vocabulary of diagnosis terms) asset_columns: columns referencing asset tables (e.g., aSegmentation_Maskcolumn)value_columns: columns holding direct values like floats, ints, or text (e.g., aconfidencescore)
The Feature object also provides feature_record_class(), which
returns a dynamically generated Pydantic model for constructing
validated feature records to insert into the catalog.
To retrieve actual feature values, use fetch_table_features
or list_feature_values instead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
The table the feature is defined on (name or Table object). |
required |
feature_name
|
str
|
Name of the feature to look up. |
required |
Returns:
| Type | Description |
|---|---|
Feature
|
A Feature schema descriptor. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If the feature doesn't exist on the specified table. |
Example
feature = ml.lookup_feature("Image", "Classification") print(f"Feature: {feature.feature_name}") print(f"Stored in: {feature.feature_table.name}") print(f"Term columns: {[c.name for c in feature.term_columns]}") print(f"Value columns: {[c.name for c in feature.value_columns]}")
Source code in src/deriva_ml/core/mixins/feature.py
267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 | |
lookup_term
lookup_term(
table: str | Table, term_name: str
) -> VocabularyTermHandle
Finds a term in a vocabulary table.
Searches for a term in the specified vocabulary table, matching either the primary name or any of its synonyms. Results are cached for performance - subsequent lookups in the same vocabulary table are served from cache.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Vocabulary table to search in (name or Table object). |
required |
term_name
|
str
|
Name or synonym of the term to find. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
VocabularyTermHandle |
VocabularyTermHandle
|
The matching vocabulary term, with methods to modify it. |
Raises:
| Type | Description |
|---|---|
DerivaMLVocabularyException
|
If the table is not a vocabulary table, or term is not found. |
Examples:
Look up by primary name: >>> term = ml.lookup_term("tissue_types", "epithelial") >>> print(term.description)
Look up by synonym: >>> term = ml.lookup_term("tissue_types", "epithelium")
Modify the term: >>> term = ml.lookup_term("tissue_types", "epithelial") >>> term.description = "Updated description" >>> term.synonyms = ("epithelium", "epithelial_tissue")
Source code in src/deriva_ml/core/mixins/vocabulary.py
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 | |
lookup_workflow
lookup_workflow(rid: RID) -> Workflow
Look up a workflow by its Resource Identifier (RID).
Retrieves a workflow from the catalog by its RID and returns a Workflow object bound to the catalog. The returned Workflow can be modified (e.g., updating its description) and changes will be reflected in the catalog.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rid
|
RID
|
Resource Identifier of the workflow to look up. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Workflow |
Workflow
|
The workflow object bound to this catalog, allowing
properties like |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If the RID does not correspond to a workflow in the catalog. |
Examples:
Look up a workflow and read its properties::
>>> workflow = ml.lookup_workflow("2-ABC1")
>>> print(f"Name: {workflow.name}")
>>> print(f"Description: {workflow.description}")
>>> print(f"Type: {workflow.workflow_type}")
Update a workflow's description (persisted to catalog)::
>>> workflow = ml.lookup_workflow("2-ABC1")
>>> workflow.description = "Updated analysis pipeline for RNA sequences"
>>> # The change is immediately written to the catalog
Attempting to update on a read-only catalog raises an error::
>>> snapshot = ml.catalog_snapshot("2023-01-15T10:30:00")
>>> workflow = snapshot.lookup_workflow("2-ABC1")
>>> workflow.description = "New description"
DerivaMLException: Cannot update workflow description on a read-only
catalog snapshot. Use a writable catalog connection instead.
Source code in src/deriva_ml/core/mixins/workflow.py
169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 | |
lookup_workflow_by_url
lookup_workflow_by_url(
url_or_checksum: str,
) -> Workflow
Look up a workflow by URL or checksum and return the full Workflow object.
Searches for a workflow in the catalog that matches the given URL or checksum and returns a Workflow object bound to the catalog. This allows you to both identify a workflow by its source code location and modify its properties (e.g., description).
The URL should be a GitHub URL pointing to the specific version of the workflow source code. The format typically includes the commit hash::
https://github.com/org/repo/blob/<commit_hash>/path/to/workflow.py
Alternatively, you can search by the Git object hash (checksum) of the workflow file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url_or_checksum
|
str
|
GitHub URL with commit hash, or Git object hash (checksum) of the workflow file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Workflow |
Workflow
|
The workflow object bound to this catalog, allowing
properties like |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If no workflow with the given URL or checksum is found in the catalog. |
Examples:
Look up a workflow by its GitHub URL::
>>> url = "https://github.com/org/repo/blob/abc123/analysis.py"
>>> workflow = ml.lookup_workflow_by_url(url)
>>> print(f"Found: {workflow.name}")
>>> print(f"Version: {workflow.version}")
Look up by Git object hash (checksum)::
>>> workflow = ml.lookup_workflow_by_url("abc123def456789...")
>>> print(f"Name: {workflow.name}")
>>> print(f"URL: {workflow.url}")
Update the workflow's description after lookup::
>>> workflow = ml.lookup_workflow_by_url(url)
>>> workflow.description = "Updated analysis pipeline"
>>> # The change is persisted to the catalog
Typical GitHub URL formats supported::
# Full blob URL with commit hash
https://github.com/org/repo/blob/abc123def/src/workflow.py
# The URL is matched exactly, so ensure it matches what was
# recorded when the workflow was registered
Source code in src/deriva_ml/core/mixins/workflow.py
252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 | |
pathBuilder
pathBuilder() -> SchemaWrapper
Returns catalog path builder for queries.
The path builder provides a fluent interface for constructing complex queries against the catalog. This is a core component used by many other methods to interact with the catalog.
Returns:
| Type | Description |
|---|---|
SchemaWrapper
|
datapath._CatalogWrapper: A new instance of the catalog path builder. |
Example
path = ml.pathBuilder.schemas['my_schema'].tables['my_table'] results = path.entities().fetch()
Source code in src/deriva_ml/core/mixins/path_builder.py
55 56 57 58 59 60 61 62 63 64 65 66 67 68 | |
prefetch_dataset
prefetch_dataset(
dataset: "DatasetSpec",
materialize: bool = True,
) -> dict[str, Any]
Deprecated: Use cache_dataset() instead.
Source code in src/deriva_ml/core/mixins/dataset.py
343 344 345 | |
remove_visible_column
remove_visible_column(
table: str | Table,
context: str,
column: str | list[str] | int,
) -> list[Any]
Remove a column from the visible-columns list for a specific context.
Convenience method for removing columns without replacing the entire visible-columns annotation. Changes are staged until apply_annotations() is called.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Table name or Table object. |
required |
context
|
str
|
The context to modify (e.g., "compact", "detailed"). |
required |
column
|
str | list[str] | int
|
Column to remove. Can be: - String: column name to find and remove - List: foreign key reference [schema, constraint] to find and remove - Integer: index position to remove (0-indexed) |
required |
Returns:
| Type | Description |
|---|---|
list[Any]
|
The updated column list for the context. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If annotation or context doesn't exist, or column not found. |
Example
ml.remove_visible_column("Image", "compact", "Description") ml.remove_visible_column("Image", "compact", 0) # Remove first column ml.apply_annotations()
Source code in src/deriva_ml/core/mixins/annotation.py
454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 | |
remove_visible_foreign_key
remove_visible_foreign_key(
table: str | Table,
context: str,
foreign_key: list[str] | int,
) -> list[Any]
Remove a foreign key from the visible-foreign-keys list for a specific context.
Convenience method for removing related tables without replacing the entire visible-foreign-keys annotation. Changes are staged until apply_annotations() is called.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Table name or Table object. |
required |
context
|
str
|
The context to modify (e.g., "detailed", "*"). |
required |
foreign_key
|
list[str] | int
|
Foreign key to remove. Can be: - List: foreign key reference [schema, constraint] to find and remove - Integer: index position to remove (0-indexed) |
required |
Returns:
| Type | Description |
|---|---|
list[Any]
|
The updated foreign key list for the context. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If annotation or context doesn't exist, or foreign key not found. |
Example
ml.remove_visible_foreign_key("Subject", "detailed", ["domain", "Image_Subject_fkey"]) ml.remove_visible_foreign_key("Subject", "detailed", 0) # Remove first ml.apply_annotations()
Source code in src/deriva_ml/core/mixins/annotation.py
672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 | |
reorder_visible_columns
reorder_visible_columns(
table: str | Table,
context: str,
new_order: list[int]
| list[
str | list[str] | dict[str, Any]
],
) -> list[Any]
Reorder columns in the visible-columns list for a specific context.
Convenience method for reordering columns without manually reconstructing the list. Changes are staged until apply_annotations() is called.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Table name or Table object. |
required |
context
|
str
|
The context to modify (e.g., "compact", "detailed"). |
required |
new_order
|
list[int] | list[str | list[str] | dict[str, Any]]
|
The new order specification. Can be: - List of indices: [2, 0, 1, 3] reorders by current positions - List of column specs: ["Name", "RID", ...] specifies exact order |
required |
Returns:
| Type | Description |
|---|---|
list[Any]
|
The reordered column list. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If annotation or context doesn't exist, or invalid order. |
Example
ml.reorder_visible_columns("Image", "compact", [2, 0, 1, 3, 4]) ml.reorder_visible_columns("Image", "compact", ["Filename", "Subject", "RID"]) ml.apply_annotations()
Source code in src/deriva_ml/core/mixins/annotation.py
536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 | |
reorder_visible_foreign_keys
reorder_visible_foreign_keys(
table: str | Table,
context: str,
new_order: list[int]
| list[list[str] | dict[str, Any]],
) -> list[Any]
Reorder foreign keys in the visible-foreign-keys list for a specific context.
Convenience method for reordering related tables without manually reconstructing the list. Changes are staged until apply_annotations() is called.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Table name or Table object. |
required |
context
|
str
|
The context to modify (e.g., "detailed", "*"). |
required |
new_order
|
list[int] | list[list[str] | dict[str, Any]]
|
The new order specification. Can be: - List of indices: [2, 0, 1] reorders by current positions - List of foreign key refs: [["schema", "fkey1"], ...] specifies exact order |
required |
Returns:
| Type | Description |
|---|---|
list[Any]
|
The reordered foreign key list. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If annotation or context doesn't exist, or invalid order. |
Example
ml.reorder_visible_foreign_keys("Subject", "detailed", [2, 0, 1]) ml.apply_annotations()
Source code in src/deriva_ml/core/mixins/annotation.py
754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 | |
resolve_rid
resolve_rid(
rid: RID,
) -> ResolveRidResult
Resolves RID to catalog location.
Looks up a RID and returns information about where it exists in the catalog, including schema, table, and column metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rid
|
RID
|
Resource Identifier to resolve. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
ResolveRidResult |
ResolveRidResult
|
Named tuple containing: - schema: Schema name - table: Table name - columns: Column definitions - datapath: Path builder for accessing the entity |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If RID doesn't exist in catalog. |
Examples:
>>> result = ml.resolve_rid("1-abc123")
>>> print(f"Found in {result.schema}.{result.table}")
>>> data = result.datapath.entities().fetch()
Source code in src/deriva_ml/core/mixins/rid_resolution.py
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | |
resolve_rids
resolve_rids(
rids: set[RID] | list[RID],
candidate_tables: list[Table]
| None = None,
) -> dict[RID, BatchRidResult]
Batch resolve multiple RIDs efficiently.
Resolves multiple RIDs in batched queries, significantly faster than calling resolve_rid() for each RID individually. Instead of N network calls for N RIDs, this makes one query per candidate table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rids
|
set[RID] | list[RID]
|
Set or list of RIDs to resolve. |
required |
candidate_tables
|
list[Table] | None
|
Optional list of Table objects to search in. If not provided, searches all tables in domain and ML schemas. |
None
|
Returns:
| Type | Description |
|---|---|
dict[RID, BatchRidResult]
|
dict[RID, BatchRidResult]: Mapping from each resolved RID to its BatchRidResult containing table information. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If any RID cannot be resolved. |
Example
results = ml.resolve_rids(["1-ABC", "2-DEF", "3-GHI"]) for rid, info in results.items(): ... print(f"{rid} is in table {info.table_name}")
Source code in src/deriva_ml/core/mixins/rid_resolution.py
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 | |
restore_execution
restore_execution(
execution_rid: RID | None = None,
) -> "Execution"
Restores a previous execution.
Given an execution RID, retrieves the execution configuration and restores the local compute environment. This routine has a number of side effects.
-
The datasets specified in the configuration are downloaded and placed in the cache-dir. If a version is not specified in the configuration, then a new minor version number is created for the dataset and downloaded.
-
If any execution assets are provided in the configuration, they are downloaded and placed in the working directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
execution_rid
|
RID | None
|
Resource Identifier (RID) of the execution to restore. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
Execution |
'Execution'
|
An execution object representing the restored execution environment. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If execution_rid is not valid or execution cannot be restored. |
Example
execution = ml.restore_execution("1-abc123")
Source code in src/deriva_ml/core/mixins/execution.py
194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 | |
retrieve_rid
retrieve_rid(
rid: RID,
) -> dict[str, Any]
Retrieves complete record for RID.
Fetches all column values for the entity identified by the RID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rid
|
RID
|
Resource Identifier of the record to retrieve. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Dictionary containing all column values for the entity. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If the RID doesn't exist in the catalog. |
Example
record = ml.retrieve_rid("1-abc123") print(f"Name: {record['name']}, Created: {record['creation_date']}")
Source code in src/deriva_ml/core/mixins/rid_resolution.py
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | |
select_by_workflow
select_by_workflow(
records: list[FeatureRecord],
workflow: str,
) -> FeatureRecord
Select the newest feature record created by a specific workflow.
Filters a list of FeatureRecord instances to only those whose
Execution was created by a matching workflow, then returns the
newest match by RCT. This is useful when multiple model runs or
annotators have labeled the same data and you want to use values
from a particular workflow.
Resolution chain:
The workflow argument is first tried as a Workflow RID. If no
workflow is found with that RID, it is treated as a Workflow_Type
name (e.g., "Training", "Feature_Creation"). The resolution
chain is:
workflow→Workflow.RID→ all Executions for that workflowworkflow→Workflow_Type.Name→ all Workflows of that type → all Executions for those workflows
Matching records are then filtered by Execution and the newest
(by RCT) is returned.
Note: Unlike FeatureRecord.select_newest, this method cannot be
passed directly as a selector argument because it requires catalog
access. Call it directly on a list of records instead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
records
|
list[FeatureRecord]
|
List of FeatureRecord instances to select from. Typically all values for a single target object from one feature. |
required |
workflow
|
str
|
Either a Workflow RID (e.g., |
required |
Returns:
| Type | Description |
|---|---|
FeatureRecord
|
The newest FeatureRecord whose execution matches the workflow. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If no workflows match the given identifier, no executions exist for the matched workflow(s), or no records in the input list were created by matching executions. |
Examples:
Select the newest label from any Training workflow::
>>> all_values = ml.list_feature_values("Image", "Classification")
>>> from collections import defaultdict
>>> by_image = defaultdict(list)
>>> for v in all_values:
... by_image[v.Image].append(v)
>>> selected = {
... img: ml.select_by_workflow(recs, "Training")
... for img, recs in by_image.items()
... }
Select by a specific workflow RID::
>>> record = ml.select_by_workflow(records, "2-ABC1")
Source code in src/deriva_ml/core/mixins/feature.py
565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 | |
set_column_display
set_column_display(
table: str | Table,
column_name: str,
annotation: dict[str, Any] | None,
) -> str
Set the column-display annotation on a column.
Controls how a column's values are rendered, including custom formatting and markdown patterns. Changes are staged locally until apply_annotations() is called.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Table name or Table object containing the column. |
required |
column_name
|
str
|
Name of the column. |
required |
annotation
|
dict[str, Any] | None
|
The column-display annotation value. Set to None to remove. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Column identifier (table.column). |
Example
ml.set_column_display("Measurement", "Value", { ... "*": {"pre_format": {"format": "%.2f"}} ... }) ml.apply_annotations()
Source code in src/deriva_ml/core/mixins/annotation.py
278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 | |
set_display_annotation
set_display_annotation(
table: str | Table,
annotation: dict[str, Any] | None,
column_name: str | None = None,
) -> str
Set the display annotation on a table or column.
The display annotation controls basic naming and display options. Changes are staged locally until apply_annotations() is called.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Table name or Table object. |
required |
annotation
|
dict[str, Any] | None
|
The display annotation value. Set to None to remove. |
required |
column_name
|
str | None
|
If provided, sets annotation on the column; otherwise on the table. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Target identifier (table name or table.column). |
Example
ml.set_display_annotation("Image", {"name": "Images"}) ml.set_display_annotation("Image", {"name": "File Name"}, column_name="Filename") ml.apply_annotations() # Commit changes
Source code in src/deriva_ml/core/mixins/annotation.py
130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 | |
set_table_display
set_table_display(
table: str | Table,
annotation: dict[str, Any] | None,
) -> str
Set the table-display annotation on a table.
Controls table-level display options like row naming patterns, page size, and row ordering. Changes are staged locally until apply_annotations() is called.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Table name or Table object. |
required |
annotation
|
dict[str, Any] | None
|
The table-display annotation value. Set to None to remove. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Table name. |
Example
ml.set_table_display("Subject", { ... "row_name": { ... "row_markdown_pattern": "{{{Name}}} ({{{Species}}})" ... } ... }) ml.apply_annotations()
Source code in src/deriva_ml/core/mixins/annotation.py
242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 | |
set_visible_columns
set_visible_columns(
table: str | Table,
annotation: dict[str, Any] | None,
) -> str
Set the visible-columns annotation on a table.
Controls which columns appear in different UI contexts and their order. Changes are staged locally until apply_annotations() is called.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Table name or Table object. |
required |
annotation
|
dict[str, Any] | None
|
The visible-columns annotation value. Set to None to remove. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Table name. |
Example
ml.set_visible_columns("Image", { ... "compact": ["RID", "Filename", "Subject"], ... "detailed": ["RID", "Filename", "Subject", "Description"] ... }) ml.apply_annotations()
Source code in src/deriva_ml/core/mixins/annotation.py
171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | |
set_visible_foreign_keys
set_visible_foreign_keys(
table: str | Table,
annotation: dict[str, Any] | None,
) -> str
Set the visible-foreign-keys annotation on a table.
Controls which related tables (via inbound foreign keys) appear in different UI contexts and their order. Changes are staged locally until apply_annotations() is called.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Table name or Table object. |
required |
annotation
|
dict[str, Any] | None
|
The visible-foreign-keys annotation value. Set to None to remove. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Table name. |
Example
ml.set_visible_foreign_keys("Subject", { ... "detailed": [ ... ["domain", "Image_Subject_fkey"], ... ["domain", "Diagnosis_Subject_fkey"] ... ] ... }) ml.apply_annotations()
Source code in src/deriva_ml/core/mixins/annotation.py
205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 | |
table_path
table_path(
table: str | Table,
schema: str | None = None,
) -> Path
Returns a local filesystem path for table CSV files.
Generates a standardized path where CSV files should be placed when preparing to upload data to a table. The path follows the project's directory structure conventions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
str | Table
|
Name of the table or Table object to get the path for. |
required |
schema
|
str | None
|
Schema name for the path. If None, uses the table's schema or default_schema. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Filesystem path where the CSV file should be placed. |
Example
path = ml.table_path("experiment_results") df.to_csv(path) # Save data for upload
Source code in src/deriva_ml/core/mixins/path_builder.py
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | |
user_list
user_list() -> List[Dict[str, str]]
Returns catalog user list.
Retrieves basic information about all users who have access to the catalog, including their identifiers and full names.
Returns:
| Type | Description |
|---|---|
List[Dict[str, str]]
|
List[Dict[str, str]]: List of user information dictionaries, each containing: - 'ID': User identifier - 'Full_Name': User's full name |
Examples:
>>> users = ml.user_list()
>>> for user in users:
... print(f"{user['Full_Name']} ({user['ID']})")
Source code in src/deriva_ml/core/base.py
676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 | |
validate_schema
validate_schema(
strict: bool = False,
) -> "SchemaValidationReport"
Validate that the catalog's ML schema matches the expected structure.
This method inspects the catalog schema and verifies that it contains all the required tables, columns, vocabulary terms, and relationships that are created by the ML schema initialization routines in create_schema.py.
The validation checks: - All required ML tables exist (Dataset, Execution, Workflow, etc.) - All required columns exist with correct types - All required vocabulary tables exist (Asset_Type, Dataset_Type, etc.) - All required vocabulary terms are initialized - All association tables exist for relationships
In strict mode, the validator also reports errors for: - Extra tables not in the expected schema - Extra columns not in the expected table definitions
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strict
|
bool
|
If True, extra tables and columns are reported as errors. If False (default), they are reported as informational items. Use strict=True to verify a clean ML catalog matches exactly. Use strict=False to validate a catalog that may have domain extensions. |
False
|
Returns:
| Type | Description |
|---|---|
'SchemaValidationReport'
|
SchemaValidationReport with validation results. Key attributes: - is_valid: True if no errors were found - errors: List of error-level issues - warnings: List of warning-level issues - info: List of informational items - to_text(): Human-readable report - to_dict(): JSON-serializable dictionary |
Example
ml = DerivaML('localhost', 'my_catalog') report = ml.validate_schema(strict=False) if report.is_valid: ... print("Schema is valid!") ... else: ... print(report.to_text())
Strict validation for a fresh ML catalog
report = ml.validate_schema(strict=True) print(f"Found {len(report.errors)} errors, {len(report.warnings)} warnings")
Get report as dictionary for JSON/logging
import json print(json.dumps(report.to_dict(), indent=2))
Note
This method validates the ML schema (typically 'deriva-ml'), not the domain schema. Domain-specific tables and columns are not checked unless they are part of the ML schema itself.
See Also
- deriva_ml.schema.validation.SchemaValidationReport
- deriva_ml.schema.validation.validate_ml_schema
Source code in src/deriva_ml/core/base.py
1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 | |
DerivaMLConfig
Bases: BaseModel
Configuration model for DerivaML instances.
This Pydantic model defines all configurable parameters for a DerivaML instance. It can be used directly or via Hydra configuration files.
Attributes:
| Name | Type | Description |
|---|---|---|
hostname |
str
|
Hostname of the Deriva server (e.g., 'deriva.example.org'). |
catalog_id |
str | int
|
Catalog identifier, either numeric ID or catalog name. |
domain_schemas |
str | set[str] | None
|
Optional set of domain schema names. If None, auto-detects all non-system schemas. Use this when working with catalogs that have multiple user-defined schemas. |
default_schema |
str | None
|
The default schema for table creation operations. If None and there is exactly one domain schema, that schema is used. If there are multiple domain schemas, this must be specified for table creation to work without explicit schema parameters. |
project_name |
str | None
|
Project name for organizing outputs. Defaults to default_schema. |
cache_dir |
str | Path | None
|
Directory for caching downloaded datasets. Defaults to working_dir/cache. |
working_dir |
str | Path | None
|
Base directory for computation data. Defaults to ~/deriva-ml. |
hydra_runtime_output_dir |
str | Path | None
|
Hydra's runtime output directory (set automatically). |
ml_schema |
str
|
Schema name for ML tables. Defaults to 'deriva-ml'. |
logging_level |
Any
|
Logging level for DerivaML. Defaults to WARNING. |
deriva_logging_level |
Any
|
Logging level for Deriva libraries. Defaults to WARNING. |
credential |
Any
|
Authentication credentials. If None, retrieved automatically. |
s3_bucket |
str | None
|
S3 bucket URL for dataset bag storage (e.g., 's3://my-bucket'). If provided, enables MINID creation and S3 upload for dataset exports. If None, MINID functionality is disabled regardless of use_minid setting. |
use_minid |
bool | None
|
Whether to use MINID service for dataset bags. Only effective when s3_bucket is configured. Defaults to True when s3_bucket is set, False otherwise. |
check_auth |
bool
|
Whether to verify authentication on connection. Defaults to True. |
clean_execution_dir |
bool
|
Whether to automatically clean execution working directories after successful upload. Defaults to True. Set to False to retain local copies of execution outputs for debugging or manual inspection. |
Example
config = DerivaMLConfig( ... hostname='deriva.example.org', ... catalog_id=1, ... default_schema='my_domain', ... logging_level=logging.INFO ... )
Source code in src/deriva_ml/core/config.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 | |
compute_workdir
staticmethod
compute_workdir(
working_dir: str | Path | None,
catalog_id: str | int | None = None,
hostname: str | None = None,
) -> Path
Compute the effective working directory path.
Creates a standardized working directory path. If a base directory is provided, appends the current username to prevent conflicts between users. If no directory is provided, uses ~/.deriva-ml. The hostname and catalog_id are appended to separate data from different servers and catalogs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
working_dir
|
str | Path | None
|
Base working directory path, or None for default. |
required |
catalog_id
|
str | int | None
|
Catalog identifier to include in the path. If None, no catalog subdirectory is created. |
None
|
hostname
|
str | None
|
Server hostname to include in the path. If None, no hostname subdirectory is created. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
Absolute path to the working directory. |
Example
DerivaMLConfig.compute_workdir('/shared/data', '52', 'ml.example.org') PosixPath('/shared/data/username/deriva-ml/ml.example.org/52') DerivaMLConfig.compute_workdir(None, 1, 'localhost') PosixPath('/home/username/.deriva-ml/localhost/1')
Source code in src/deriva_ml/core/config.py
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 | |
init_working_dir
init_working_dir() -> DerivaMLConfig
Initialize working directory and resolve use_minid after model validation.
Sets up the working directory path, computing a default if not specified. Also captures Hydra's runtime output directory for logging and outputs.
Resolves the use_minid flag based on s3_bucket configuration: - If use_minid is explicitly set, use that value (but it only takes effect if s3_bucket is set) - If use_minid is None (auto), set it to True if s3_bucket is configured, False otherwise
This validator runs after all field validation and ensures the working directory is available for Hydra configuration resolution.
Returns:
| Name | Type | Description |
|---|---|---|
Self |
DerivaMLConfig
|
The configuration instance with initialized paths. |
Source code in src/deriva_ml/core/config.py
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 | |
DerivaMLException
Bases: Exception
Base exception class for all DerivaML errors.
This is the root exception for all DerivaML-specific errors. Catching this exception will catch any error raised by the DerivaML library.
Attributes:
| Name | Type | Description |
|---|---|---|
_msg |
The error message stored for later access. |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
msg
|
str
|
Descriptive error message. Defaults to empty string. |
''
|
Example
raise DerivaMLException("Failed to connect to catalog") DerivaMLException: Failed to connect to catalog
Source code in src/deriva_ml/core/exceptions.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | |
DerivaMLInvalidTerm
Bases: DerivaMLNotFoundError
Exception raised when a vocabulary term is not found or invalid.
Raised when attempting to look up or use a term that doesn't exist in a controlled vocabulary table, or when a term name/synonym cannot be resolved.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocabulary
|
str
|
Name of the vocabulary table being searched. |
required |
term
|
str
|
The term name that was not found. |
required |
msg
|
str
|
Additional context about the error. Defaults to "Term doesn't exist". |
"Term doesn't exist"
|
Example
raise DerivaMLInvalidTerm("Diagnosis", "unknown_condition") DerivaMLInvalidTerm: Invalid term unknown_condition in vocabulary Diagnosis: Term doesn't exist.
Source code in src/deriva_ml/core/exceptions.py
178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 | |
DerivaMLTableTypeError
Bases: DerivaMLDataError
Exception raised when a RID or table is not of the expected type.
Raised when an operation requires a specific table type (e.g., Dataset, Execution) but receives a RID or table reference of a different type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_type
|
str
|
The expected table type (e.g., "Dataset", "Execution"). |
required |
table
|
str
|
The actual table name or RID that was provided. |
required |
Example
raise DerivaMLTableTypeError("Dataset", "1-ABC123") DerivaMLTableTypeError: Table 1-ABC123 is not of type Dataset.
Source code in src/deriva_ml/core/exceptions.py
200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 | |
ExecAssetType
Bases: BaseStrEnum
Execution asset type identifiers.
Defines the types of assets that can be produced or consumed during an execution. These types are used to categorize files associated with workflow runs.
Attributes:
| Name | Type | Description |
|---|---|---|
input_file |
str
|
Input file consumed by the execution. |
output_file |
str
|
Output file produced by the execution. |
notebook_output |
str
|
Jupyter notebook output from the execution. |
model_file |
str
|
Machine learning model file (e.g., .pkl, .h5, .pt). |
Source code in src/deriva_ml/core/enums.py
180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | |
ExecMetadataType
Bases: BaseStrEnum
Execution metadata type identifiers.
Defines the types of metadata that can be associated with an execution.
Attributes:
| Name | Type | Description |
|---|---|---|
execution_config |
str
|
General execution configuration data. |
runtime_env |
str
|
Runtime environment information. |
hydra_config |
str
|
Hydra YAML configuration files (config.yaml, overrides.yaml). |
deriva_config |
str
|
DerivaML execution configuration (configuration.json). |
Source code in src/deriva_ml/core/enums.py
162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 | |
FileSpec
Bases: BaseModel
Specification for a file to be added to the Deriva catalog.
Represents file metadata required for creating entries in the File table. Handles URL normalization, ensuring local file paths are converted to tag URIs that uniquely identify the file's origin.
Attributes:
| Name | Type | Description |
|---|---|---|
url |
str
|
File location as URL or local path. Local paths are converted to tag URIs. |
md5 |
str
|
MD5 checksum for integrity verification. |
length |
int
|
File size in bytes. |
description |
str | None
|
Optional description of the file's contents or purpose. |
file_types |
list[str] | None
|
List of file type classifications from the Asset_Type vocabulary. |
Note
The 'File' type is automatically added to file_types if not present when using create_filespecs().
Example
spec = FileSpec( ... url="/data/results.csv", ... md5="d41d8cd98f00b204e9800998ecf8427e", ... length=1024, ... description="Analysis results", ... file_types=["CSV", "Data"] ... )
Source code in src/deriva_ml/core/filespec.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 | |
create_filespecs
classmethod
create_filespecs(
path: Path | str,
description: str,
file_types: list[str]
| Callable[[Path], list[str]]
| None = None,
) -> Generator[FileSpec, None, None]
Generate FileSpec objects for a file or directory.
Creates FileSpec objects with computed MD5 checksums for each file found. For directories, recursively processes all files. The 'File' type is automatically prepended to file_types if not already present.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
Path to a file or directory. If directory, all files are processed recursively. |
required |
description
|
str
|
Description to apply to all generated FileSpecs. |
required |
file_types
|
list[str] | Callable[[Path], list[str]] | None
|
Either a static list of file types, or a callable that takes a Path and returns a list of types for that specific file. Allows dynamic type assignment based on file extension, content, etc. |
None
|
Yields:
| Name | Type | Description |
|---|---|---|
FileSpec |
FileSpec
|
A specification for each file with computed checksums and metadata. |
Example
Static file types: >>> specs = FileSpec.create_filespecs("/data/images", "Images", ["Image"])
Dynamic file types based on extension: >>> def get_types(path): ... ext = path.suffix.lower() ... return {"png": ["PNG", "Image"], ".jpg": ["JPEG", "Image"]}.get(ext, []) >>> specs = FileSpec.create_filespecs("/data", "Mixed files", get_types)
Source code in src/deriva_ml/core/filespec.py
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | |
read_filespec
staticmethod
read_filespec(
path: Path | str,
) -> Generator[FileSpec, None, None]
Read FileSpec objects from a JSON Lines file.
Parses a JSONL file where each line is a JSON object representing a FileSpec. Empty lines are skipped. This is useful for batch processing pre-computed file specifications.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
Path to the .jsonl file containing FileSpec data. |
required |
Yields:
| Name | Type | Description |
|---|---|---|
FileSpec |
FileSpec
|
Parsed FileSpec object for each valid line. |
Example
for spec in FileSpec.read_filespec("files.jsonl"): ... print(f"{spec.url}: {spec.md5}")
Source code in src/deriva_ml/core/filespec.py
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 | |
validate_file_url
classmethod
validate_file_url(url: str) -> str
Examine the provided URL. If it's a local path, convert it into a tag URL.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
The URL to validate and potentially convert |
required |
Returns:
| Type | Description |
|---|---|
str
|
The validated/converted URL |
Raises:
| Type | Description |
|---|---|
ValidationError
|
If the URL is not a file URL |
Source code in src/deriva_ml/core/filespec.py
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | |
FileUploadState
Bases: BaseModel
Tracks the state and result of a file upload operation.
Attributes:
| Name | Type | Description |
|---|---|---|
state |
UploadState
|
Current state of the upload (success, failed, etc.). |
status |
str
|
Detailed status message. |
result |
Any
|
Upload result data, if any. |
Source code in src/deriva_ml/core/ermrest.py
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | |
LoggerMixin
Mixin class that provides a _logger attribute.
Classes that inherit from this mixin get a _logger property that returns a child logger under the deriva_ml namespace, named after the class.
Example
class MyProcessor(LoggerMixin): ... def process(self): ... self._logger.info("Processing started") ...
Logs to 'deriva_ml.MyProcessor'
Source code in src/deriva_ml/core/logging_config.py
197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 | |
MLAsset
Bases: BaseStrEnum
Asset type identifiers.
Defines the types of assets that can be associated with executions.
Attributes:
| Name | Type | Description |
|---|---|---|
execution_metadata |
str
|
Metadata about an execution. |
execution_asset |
str
|
Asset produced by an execution. |
Source code in src/deriva_ml/core/enums.py
119 120 121 122 123 124 125 126 127 128 129 130 | |
MLVocab
Bases: BaseStrEnum
Controlled vocabulary table identifiers.
Defines the names of controlled vocabulary tables used in DerivaML. These tables store standardized terms with descriptions and synonyms for consistent data classification across the catalog.
Attributes:
| Name | Type | Description |
|---|---|---|
dataset_type |
str
|
Dataset classification vocabulary (e.g., "Training", "Test"). |
workflow_type |
str
|
Workflow classification vocabulary (e.g., "Python", "Notebook"). |
asset_type |
str
|
Asset/file type classification vocabulary (e.g., "Image", "CSV"). |
asset_role |
str
|
Asset role vocabulary for execution relationships (e.g., "Input", "Output"). |
feature_name |
str
|
Feature name vocabulary for ML feature definitions. |
Source code in src/deriva_ml/core/enums.py
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | |
UploadState
Bases: Enum
File upload operation states.
Represents the various states a file upload operation can be in, from initiation to completion.
Attributes:
| Name | Type | Description |
|---|---|---|
success |
int
|
Upload completed successfully. |
failed |
int
|
Upload failed. |
pending |
int
|
Upload is queued. |
running |
int
|
Upload is in progress. |
paused |
int
|
Upload is temporarily paused. |
aborted |
int
|
Upload was aborted. |
cancelled |
int
|
Upload was cancelled. |
timeout |
int
|
Upload timed out. |
Source code in src/deriva_ml/core/enums.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | |
configure_logging
configure_logging(
level: int = logging.WARNING,
deriva_level: int | None = None,
format_string: str = DEFAULT_FORMAT,
handler: Handler | None = None,
) -> logging.Logger
Configure logging for DerivaML and related libraries.
This function sets up logging levels for DerivaML, related libraries (deriva-py, bdbag, bagit), and Hydra loggers. It is designed to:
- Configure only specific logger namespaces, not the root logger
- Respect Hydra's logging configuration when running under Hydra
- Allow deriva-py libraries to have a separate logging level
The logging level hierarchy
- deriva_ml logger: uses
level - Hydra loggers: follow
level(deriva_ml level) - Deriva/bdbag/bagit loggers: use
deriva_level(defaults tolevel)
When running under Hydra
- Only sets log levels on specific loggers
- Does NOT add handlers (Hydra has already configured them)
- Does NOT call basicConfig()
When running standalone (no Hydra): - Sets log levels on specific loggers - Adds a StreamHandler to deriva_ml logger if none exists - Still does NOT touch the root logger or call basicConfig()
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
level
|
int
|
Log level for deriva_ml and Hydra loggers. Defaults to WARNING. |
WARNING
|
deriva_level
|
int | None
|
Log level for deriva-py libraries (deriva, bagit, bdbag).
If None, uses the same level as |
None
|
format_string
|
str
|
Format string for log messages (used only when adding handlers outside Hydra context). |
DEFAULT_FORMAT
|
handler
|
Handler | None
|
Optional handler to add to the deriva_ml logger. If None and not running under Hydra, uses StreamHandler with format_string. |
None
|
Returns:
| Type | Description |
|---|---|
Logger
|
The configured deriva_ml logger. |
Example
import logging
Same level for everything
configure_logging(level=logging.DEBUG)
Verbose DerivaML, quieter deriva-py libraries
configure_logging( ... level=logging.INFO, ... deriva_level=logging.WARNING, ... )
Source code in src/deriva_ml/core/logging_config.py
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 | |
get_logger
get_logger(
name: str | None = None,
) -> logging.Logger
Get a DerivaML logger.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str | None
|
Optional sub-logger name. If provided, returns a child logger under the deriva_ml namespace (e.g., 'deriva_ml.dataset'). If None, returns the main deriva_ml logger. |
None
|
Returns:
| Type | Description |
|---|---|
Logger
|
The configured logger instance. |
Example
logger = get_logger() # Main deriva_ml logger dataset_logger = get_logger("dataset") # deriva_ml.dataset
Source code in src/deriva_ml/core/logging_config.py
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | |
is_hydra_initialized
is_hydra_initialized() -> bool
Check if running within an initialized Hydra context.
This is used to determine whether Hydra is managing logging configuration. When Hydra is initialized, we avoid adding handlers or calling basicConfig since Hydra has already configured logging via dictConfig.
Returns:
| Type | Description |
|---|---|
bool
|
True if Hydra's GlobalHydra is initialized, False otherwise. |
Example
if is_hydra_initialized(): ... # Hydra is managing logging ... pass
Source code in src/deriva_ml/core/logging_config.py
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | |