Feature Classes
Feature management for ML experiments. Features represent measurable properties or characteristics that can be attached to domain entities and tracked across executions.
Feature implementation for deriva-ml.
This module provides classes for defining and managing features in deriva-ml. Features represent measurable properties or characteristics associated with records in a target table (e.g., a diagnostic label on an Image row).
Exported classes
Feature: Encapsulates a feature's schema — target table, vocabulary columns,
asset columns, and value columns. Obtained via DerivaML.create_feature
or DerivaML.lookup_feature. Not constructed directly.
FeatureRecord: Pydantic base class for dynamically generated feature record
models. Subclasses are created by Feature.feature_record_class().
Selector classmethod suite (FeatureRecord class methods):
FeatureRecord.select_newest(records) — Returns the record with the most
recent RCT (Row Creation Time). Useful when multiple annotators have
labelled the same object.
FeatureRecord.select_first(records) — Returns the record with the
earliest RCT. Useful to preserve the original annotation.
FeatureRecord.select_latest(records) — Alias for select_newest.
FeatureRecord.select_by_execution(execution_rid) — Returns a selector
that picks the newest record from a specific execution run.
FeatureRecord.select_by_workflow(workflow, *, container) — Returns a
selector that picks the newest record from any execution of the named
workflow. Resolves the execution list eagerly at construction time.
FeatureRecord.select_majority_vote(column) — Returns a selector that
picks the most common value for a column (consensus labeling).
Typical usage
feature = ml.lookup_feature("Image", "Diagnosis") # doctest: +SKIP DiagnosisRecord = feature.feature_record_class() # doctest: +SKIP record = DiagnosisRecord(Diagnosis="benign", Confidence=0.97) # doctest: +SKIP
Feature
Manages feature definitions and their relationships in the catalog.
A Feature represents a measurable property or characteristic that can be associated with records in a table. Features can include asset references, controlled vocabulary terms, and custom metadata fields.
Attributes:
| Name | Type | Description |
|---|---|---|
feature_table |
Table
|
Table containing the feature implementation. |
target_table |
Table
|
Table that the feature is associated with. |
feature_name |
str
|
Name of the feature (from Feature_Name column default). |
feature_columns |
set[Column]
|
All columns specific to this feature. |
asset_columns |
set[Column]
|
Columns referencing asset tables. |
term_columns |
set[Column]
|
Columns referencing vocabulary tables. |
value_columns |
set[Column]
|
Columns containing direct values (not FK references). |
Example
feature = ml.lookup_feature("Image", "Diagnosis") # doctest: +SKIP print(f"Feature {feature.feature_name} on {feature.target_table.name}") # doctest: +SKIP print("Asset columns:", [c.name for c in feature.asset_columns]) # doctest: +SKIP
Source code in src/deriva_ml/feature.py
473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 | |
__init__
__init__(
atable: FindAssociationResult,
model: DerivaModel,
) -> None
Initialize a Feature from an association table result.
Classifies the feature table's FK columns into three disjoint sets:
asset_columns (FK to an asset table), term_columns (FK to a
vocabulary table), and value_columns (everything else). The
association FKs linking back to the target table and to the feature
name vocabulary are excluded before classification.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
atable
|
FindAssociationResult
|
Result from |
required |
model
|
DerivaModel
|
|
required |
Note
This constructor is not part of the public API. Obtain Feature
instances via DerivaML.create_feature or
DerivaML.lookup_feature.
Source code in src/deriva_ml/feature.py
494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 | |
feature_record_class
feature_record_class() -> type[
FeatureRecord
]
Create a dynamically generated Pydantic model class for this feature.
Builds a FeatureRecord subclass with fields derived from the feature
table's columns. Column types are mapped as follows:
- Term columns (FK to vocabulary):
str(vocabulary term name) - Asset columns (FK to asset table):
str | Path(file path) - Value columns (direct data): typed per the database column type
(
int,float,bool, orstr)
All feature-specific fields are Optional with a default of None
to allow partial construction when building records for insertion.
The Feature_Name field defaults to this feature's name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
self
|
The |
required |
Returns:
| Type | Description |
|---|---|
type[FeatureRecord]
|
A subclass of |
type[FeatureRecord]
|
schema. The class's |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If the feature table schema cannot be read. |
Example
feature = ml.lookup_feature("Image", "Diagnosis") # doctest: +SKIP DiagnosisRecord = feature.feature_record_class() # doctest: +SKIP rec = DiagnosisRecord(Diagnosis="benign") # doctest: +SKIP
Source code in src/deriva_ml/feature.py
555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 | |
FeatureRecord
Bases: BaseModel
Base class for dynamically generated feature record models.
This class serves as the base for pydantic models that represent feature
records. Each feature record contains the values and metadata associated
with a feature instance. Subclasses are created dynamically by
Feature.feature_record_class() with fields corresponding to the
feature's vocabulary terms, asset references, and metadata columns.
Feature records are returned by feature_values(). They can also be
constructed manually and passed to Execution.add_features() to insert
new values into the catalog.
Handling multiple values per target:
When the same target object (e.g., an Image) has multiple feature values
— for example, labels from different annotators or model runs — use
a selector function to choose one. Pass it to feature_values.
A selector receives a list of FeatureRecord instances for the same target
and returns the selected one::
# Built-in: pick the most recently created record
for rec in ml.feature_values("Image", selector=FeatureRecord.select_newest):
...
# Custom: pick the record with highest confidence
def select_best(records):
return max(records, key=lambda r: getattr(r, "Confidence", 0))
for rec in ml.feature_values("Image", selector=select_best):
...
Attributes:
| Name | Type | Description |
|---|---|---|
Execution |
Optional[str]
|
RID of the execution that created this
feature record. Links to the |
Feature_Name |
str
|
Name of the feature this record belongs to. |
RCT |
Optional[str]
|
Row Creation Time — an ISO 8601 timestamp string
(e.g., |
feature |
ClassVar[Optional[Feature]]
|
Reference to the Feature
definition object. Set automatically by |
Source code in src/deriva_ml/feature.py
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 | |
asset_columns
classmethod
asset_columns() -> set[Column]
Return columns that reference asset tables.
Asset columns are FK columns whose referent table is classified as an
asset table (e.g., Image, Scan). In a generated
FeatureRecord subclass these fields accept str | Path values.
Returns:
| Type | Description |
|---|---|
set[Column]
|
set[Column]: ERMrest |
set[Column]
|
asset tables. A subset of |
Note
Only available on a class returned by Feature.feature_record_class().
Calling this on the FeatureRecord base class (where feature
is None) raises AttributeError.
Source code in src/deriva_ml/feature.py
413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 | |
feature_columns
classmethod
feature_columns() -> set[Column]
Return all columns specific to this feature.
Returns the full set of feature-specific columns — the union of
asset_columns, term_columns, and value_columns. System
columns (RID, RCT, RMT, RCB, RMB) and structural
association columns (Feature_Name, the target-table FK, and
Execution) are excluded.
Returns:
| Type | Description |
|---|---|
set[Column]
|
set[Column]: Feature-specific ERMrest |
set[Column]
|
to |
Note
Only available on a class returned by Feature.feature_record_class().
Calling this on the FeatureRecord base class (where feature
is None) raises AttributeError.
Source code in src/deriva_ml/feature.py
392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 | |
select_by_execution
staticmethod
select_by_execution(execution_rid: str)
Return a selector that picks the newest record from a specific execution.
Creates a selector function that filters records to those produced by the given execution, then returns the newest match by RCT. This is useful when multiple executions have produced values for the same feature and you want results from a specific run.
Unlike select_by_workflow (a factory that resolves the workflow's
execution set from a container), this selector filters on a known
Execution RID with no container dependency and can be passed
directly as the selector argument to feature_values::
for rec in ml.feature_values(
"Image",
feature_name="FooBar",
selector=FeatureRecord.select_by_execution("3WY2"),
):
...
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
execution_rid
|
str
|
RID of the execution to filter by. |
required |
Returns:
| Type | Description |
|---|---|
|
A selector function |
|
|
suitable for use as the |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If no records in the group match the given execution RID. |
Examples:
Select values from a specific execution::
>>> for rec in ml.feature_values( # doctest: +SKIP
... "Image",
... feature_name="Classification",
... selector=FeatureRecord.select_by_execution("3WY2"),
... ):
... print(rec)
Source code in src/deriva_ml/feature.py
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | |
select_by_workflow
classmethod
select_by_workflow(
workflow: str, *, container
) -> Callable[
[list[FeatureRecord]],
FeatureRecord | None,
]
Return a selector that picks the newest record from a specific workflow.
Creates a selector function that filters records to those produced by
executions of the given workflow, then returns the newest match by RCT.
This is the recommended replacement for the retired
DerivaML.select_by_workflow(records, workflow) method.
Unlike select_by_execution, which requires knowing a specific
execution RID, this selector works at the workflow level — it accepts
any record produced by any execution of the named workflow.
Eager resolution: the workflow's execution list is resolved once
at factory-construction time by calling
container.list_workflow_executions(workflow). Unknown-workflow
errors therefore surface immediately (at factory-call time), not
lazily during iteration.
None return semantics: when no record in a group matches the
workflow, the selector returns None. feature_values treats
None as "feature absent for this target RID" and omits the target
from the iterator. This is distinct from select_by_execution, which
raises on no-match.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
workflow
|
str
|
Name (or RID) of the workflow to filter by. Must be a
workflow known to |
required |
container
|
Required keyword-only argument. An object that
implements |
required |
Returns:
| Type | Description |
|---|---|
Callable[[list[FeatureRecord]], FeatureRecord | None]
|
A selector callable |
Callable[[list[FeatureRecord]], FeatureRecord | None]
|
suitable for use as the |
Callable[[list[FeatureRecord]], FeatureRecord | None]
|
|
Callable[[list[FeatureRecord]], FeatureRecord | None]
|
when no record in the group matches the workflow; returns the |
Callable[[list[FeatureRecord]], FeatureRecord | None]
|
newest matching record (by RCT) otherwise. |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If |
TypeError
|
If |
Example
Select Glaucoma labels produced by a specific training workflow::
>>> selector = FeatureRecord.select_by_workflow( # doctest: +SKIP
... "Glaucoma_Training_v2", container=ml
... )
>>> for rec in ml.feature_values( # doctest: +SKIP
... "Image", "Glaucoma", selector=selector
... ):
... print(f"{rec.Image}: {rec.Glaucoma}")
Works identically on a downloaded bag (offline)::
>>> selector = FeatureRecord.select_by_workflow( # doctest: +SKIP
... "Glaucoma_Training_v2", container=bag
... )
>>> labels = list(bag.feature_values("Image", "Glaucoma", selector=selector)) # doctest: +SKIP
Source code in src/deriva_ml/feature.py
188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 | |
select_first
staticmethod
select_first(
records: list[FeatureRecord],
) -> FeatureRecord
Select the feature record with the earliest creation time.
Uses the RCT (Row Creation Time) field. Records with None RCT
are treated as older than any timestamped record (since empty string
sorts before any ISO 8601 timestamp).
Useful when you want to preserve the original annotation and ignore later revisions.
This method is designed to be passed directly as the selector
argument to feature_values::
for rec in ml.feature_values("Image", selector=FeatureRecord.select_first):
...
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
records
|
list[FeatureRecord]
|
List of FeatureRecord instances for the same target object. Must be non-empty. |
required |
Returns:
| Type | Description |
|---|---|
FeatureRecord
|
The FeatureRecord with the earliest RCT value. |
Source code in src/deriva_ml/feature.py
271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 | |
select_latest
staticmethod
select_latest(
records: list[FeatureRecord],
) -> FeatureRecord
Select the most recently created feature record.
Alias for select_newest. Included for API symmetry with
select_first.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
records
|
list[FeatureRecord]
|
List of FeatureRecord instances for the same target object. Must be non-empty. |
required |
Returns:
| Type | Description |
|---|---|
FeatureRecord
|
The FeatureRecord with the latest RCT value. |
Source code in src/deriva_ml/feature.py
297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 | |
select_majority_vote
classmethod
select_majority_vote(
column: str | None = None,
)
Return a selector that picks the most common value for a column.
Creates a selector function that counts the values of the specified column across all records, picks the most frequent one, and breaks ties by most recent RCT.
For single-term features, the column can be auto-detected from the feature's metadata. For multi-term features, the column must be specified explicitly.
This is useful for consensus labeling, where multiple annotators have labeled the same record and you want the majority opinion::
selector = RecordClass.select_majority_vote()
for rec in ml.feature_values(
"Image",
feature_name="Diagnosis",
selector=selector,
):
...
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str | None
|
Name of the column to count values for. If None, auto-detects the first term column from feature metadata. |
None
|
Returns:
| Type | Description |
|---|---|
|
A selector function |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If column is None and the feature has no term columns or multiple term columns. |
Source code in src/deriva_ml/feature.py
313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 | |
select_newest
staticmethod
select_newest(
records: list[FeatureRecord],
) -> FeatureRecord
Select the feature record with the most recent creation time.
Uses the RCT (Row Creation Time) field to determine recency. RCT is
an ISO 8601 timestamp string, so lexicographic comparison correctly
identifies the most recent record. Records with None RCT are
treated as older than any timestamped record.
This method is designed to be passed directly as the selector
argument to feature_values::
for rec in ml.feature_values("Image", selector=FeatureRecord.select_newest):
...
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
records
|
list[FeatureRecord]
|
List of FeatureRecord instances for the same target object. Must be non-empty. |
required |
Returns:
| Type | Description |
|---|---|
FeatureRecord
|
The FeatureRecord with the latest RCT value. |
Source code in src/deriva_ml/feature.py
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 | |
term_columns
classmethod
term_columns() -> set[Column]
Return columns that reference controlled vocabulary terms.
Term columns are FK columns whose referent table is classified as a
vocabulary table. In a generated FeatureRecord subclass these
fields accept str values (the term name, not the RID).
Returns:
| Type | Description |
|---|---|
set[Column]
|
set[Column]: ERMrest |
set[Column]
|
vocabulary tables. A subset of |
Note
Only available on a class returned by Feature.feature_record_class().
Calling this on the FeatureRecord base class (where feature
is None) raises AttributeError.
Source code in src/deriva_ml/feature.py
432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 | |
value_columns
classmethod
value_columns() -> set[Column]
Return columns that contain direct (non-FK) values.
Value columns hold scalar data — integers, floats, booleans, or text
— rather than FK references to other tables. In a generated
FeatureRecord subclass these fields are typed according to the
ERMrest column type (int, float, bool, or str).
Returns:
| Type | Description |
|---|---|
set[Column]
|
set[Column]: ERMrest |
set[Column]
|
values. Computed as ``feature_columns() - asset_columns() - |
set[Column]
|
term_columns()``. |
Note
Only available on a class returned by Feature.feature_record_class().
Calling this on the FeatureRecord base class (where feature
is None) raises AttributeError.
Source code in src/deriva_ml/feature.py
451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 | |