Skip to content

Feature Classes

Feature management for ML experiments. Features represent measurable properties or characteristics that can be attached to domain entities and tracked across executions.

Feature implementation for deriva-ml.

This module provides classes for defining and managing features in deriva-ml. Features represent measurable properties or characteristics associated with records in a target table (e.g., a diagnostic label on an Image row).

Exported classes

Feature: Encapsulates a feature's schema — target table, vocabulary columns, asset columns, and value columns. Obtained via DerivaML.create_feature or DerivaML.lookup_feature. Not constructed directly. FeatureRecord: Pydantic base class for dynamically generated feature record models. Subclasses are created by Feature.feature_record_class().

Selector classmethod suite (FeatureRecord class methods): FeatureRecord.select_newest(records) — Returns the record with the most recent RCT (Row Creation Time). Useful when multiple annotators have labelled the same object. FeatureRecord.select_first(records) — Returns the record with the earliest RCT. Useful to preserve the original annotation. FeatureRecord.select_latest(records) — Alias for select_newest. FeatureRecord.select_by_execution(execution_rid) — Returns a selector that picks the newest record from a specific execution run. FeatureRecord.select_by_workflow(workflow, *, container) — Returns a selector that picks the newest record from any execution of the named workflow. Resolves the execution list eagerly at construction time. FeatureRecord.select_majority_vote(column) — Returns a selector that picks the most common value for a column (consensus labeling).

Typical usage

feature = ml.lookup_feature("Image", "Diagnosis") # doctest: +SKIP DiagnosisRecord = feature.feature_record_class() # doctest: +SKIP record = DiagnosisRecord(Diagnosis="benign", Confidence=0.97) # doctest: +SKIP

Feature

Manages feature definitions and their relationships in the catalog.

A Feature represents a measurable property or characteristic that can be associated with records in a table. Features can include asset references, controlled vocabulary terms, and custom metadata fields.

Attributes:

Name Type Description
feature_table Table

Table containing the feature implementation.

target_table Table

Table that the feature is associated with.

feature_name str

Name of the feature (from Feature_Name column default).

feature_columns set[Column]

All columns specific to this feature.

asset_columns set[Column]

Columns referencing asset tables.

term_columns set[Column]

Columns referencing vocabulary tables.

value_columns set[Column]

Columns containing direct values (not FK references).

Example

feature = ml.lookup_feature("Image", "Diagnosis") # doctest: +SKIP print(f"Feature {feature.feature_name} on {feature.target_table.name}") # doctest: +SKIP print("Asset columns:", [c.name for c in feature.asset_columns]) # doctest: +SKIP

Source code in src/deriva_ml/feature.py
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
class Feature:
    """Manages feature definitions and their relationships in the catalog.

    A Feature represents a measurable property or characteristic that can be associated with records in a table.
    Features can include asset references, controlled vocabulary terms, and custom metadata fields.

    Attributes:
        feature_table (Table): Table containing the feature implementation.
        target_table (Table): Table that the feature is associated with.
        feature_name (str): Name of the feature (from Feature_Name column default).
        feature_columns (set[Column]): All columns specific to this feature.
        asset_columns (set[Column]): Columns referencing asset tables.
        term_columns (set[Column]): Columns referencing vocabulary tables.
        value_columns (set[Column]): Columns containing direct values (not FK references).

    Example:
        >>> feature = ml.lookup_feature("Image", "Diagnosis")  # doctest: +SKIP
        >>> print(f"Feature {feature.feature_name} on {feature.target_table.name}")  # doctest: +SKIP
        >>> print("Asset columns:", [c.name for c in feature.asset_columns])  # doctest: +SKIP
    """

    def __init__(self, atable: FindAssociationResult, model: "DerivaModel") -> None:
        """Initialize a Feature from an association table result.

        Classifies the feature table's FK columns into three disjoint sets:
        ``asset_columns`` (FK to an asset table), ``term_columns`` (FK to a
        vocabulary table), and ``value_columns`` (everything else). The
        association FKs linking back to the target table and to the feature
        name vocabulary are excluded before classification.

        Args:
            atable: Result from ``deriva.core.ermrest_model.FindAssociationResult``
                describing the feature association table. Provides the feature
                table, the self-FK back to the target, and the set of other FKs.
            model: ``DerivaModel`` instance used to classify FK targets as
                asset or vocabulary tables.

        Note:
            This constructor is not part of the public API. Obtain ``Feature``
            instances via ``DerivaML.create_feature`` or
            ``DerivaML.lookup_feature``.
        """
        self.feature_table = atable.table
        self.target_table = atable.self_fkey.pk_table
        self.feature_name = atable.table.columns["Feature_Name"].default
        self._model = model

        skip_columns = {
            "RID",
            "RMB",
            "RCB",
            "RCT",
            "RMT",
            "Feature_Name",
            self.target_table.name,
            "Execution",
        }
        self.feature_columns = {c for c in self.feature_table.columns if c.name not in skip_columns}

        # Exclude the two FKs that are structural parts of the association table
        # itself — the self-FK pointing back to the target table (e.g., Image)
        # and the other-FKs pointing to Feature_Name and Execution — before
        # classifying the remaining FKs as asset, term, or value columns. Without
        # this subtraction, those structural FKs would be misclassified as feature
        # columns and create spurious fields in the generated FeatureRecord class.
        assoc_fkeys = {atable.self_fkey} | atable.other_fkeys

        # Determine the role of each column in the feature outside the FK columns.
        self.asset_columns = {
            fk.foreign_key_columns[0]
            for fk in self.feature_table.foreign_keys
            if fk not in assoc_fkeys and self._model.is_asset(fk.pk_table)
        }

        self.term_columns = {
            fk.foreign_key_columns[0]
            for fk in self.feature_table.foreign_keys
            if fk not in assoc_fkeys and self._model.is_vocabulary(fk.pk_table)
        }

        self.value_columns = self.feature_columns - (self.asset_columns | self.term_columns)

    def feature_record_class(self) -> type[FeatureRecord]:
        """Create a dynamically generated Pydantic model class for this feature.

        Builds a ``FeatureRecord`` subclass with fields derived from the feature
        table's columns. Column types are mapped as follows:

        - Term columns (FK to vocabulary): ``str`` (vocabulary term name)
        - Asset columns (FK to asset table): ``str | Path`` (file path)
        - Value columns (direct data): typed per the database column type
          (``int``, ``float``, ``bool``, or ``str``)

        All feature-specific fields are ``Optional`` with a default of ``None``
        to allow partial construction when building records for insertion.
        The ``Feature_Name`` field defaults to this feature's name.

        Args:
            self: The ``Feature`` instance whose schema drives field generation.

        Returns:
            A subclass of ``FeatureRecord`` whose fields match this feature's
            schema. The class's ``feature`` ClassVar is set to ``self``.

        Raises:
            DerivaMLException: If the feature table schema cannot be read.

        Example:
            >>> feature = ml.lookup_feature("Image", "Diagnosis")  # doctest: +SKIP
            >>> DiagnosisRecord = feature.feature_record_class()  # doctest: +SKIP
            >>> rec = DiagnosisRecord(Diagnosis="benign")  # doctest: +SKIP
        """

        def map_type(c: Column) -> UnionType | Type[str] | Type[int] | Type[float]:
            """Maps a Deriva column type to a Python/pydantic type.

            Converts ERMrest column types to appropriate Python types for use in pydantic models.
            Special handling is provided for asset columns which can accept either strings or Path objects.

            Args:
                c: ERMrest column to map to a Python type.

            Returns:
                UnionType | Type[str] | Type[int] | Type[float]: Appropriate Python type for the column:
                    - str | Path for asset columns
                    - str for text columns
                    - int for integer columns
                    - float for floating point columns
                    - str for all other types

            Example:
                >>> col = Column(name="score", type="float4")
                >>> typ = map_type(col)  # Returns float
            """
            if c.name in {c.name for c in self.asset_columns}:
                return str | Path

            match c.type.typename:
                case "text":
                    return str
                case "int2" | "int4" | "int8":
                    return int
                case "float4" | "float8":
                    return float
                case "boolean":
                    return bool
                case _:
                    return str

        featureclass_name = f"{self.target_table.name}Feature{self.feature_name}"

        # Create feature class. To do this, we must determine the python type for each column and also if the
        # column is optional or not based on its nullability.
        feature_columns = {
            c.name: (
                Optional[map_type(c)] if c.nullok else map_type(c),
                c.default or None,
            )
            for c in self.feature_columns
        } | {
            "Feature_Name": (
                str,
                self.feature_name,
            ),  # Set default value for Feature_Name
            self.target_table.name: (str, ...),
        }
        docstring = (
            f"Class to capture fields in a feature {self.feature_name} on table {self.target_table}. "
            "Feature columns include:\n"
        )
        docstring += "\n".join([f"    {c.name}" for c in self.feature_columns])

        model = create_model(
            featureclass_name,
            __base__=FeatureRecord,
            __doc__=docstring,
            **feature_columns,
        )
        model.feature = self  # Set value of class variable within the feature class definition.

        return model

    def __repr__(self) -> str:
        return (
            f"Feature(target_table={self.target_table.name}, feature_name={self.feature_name}, "
            f"feature_table={self.feature_table.name})"
        )

__init__

__init__(
    atable: FindAssociationResult,
    model: DerivaModel,
) -> None

Initialize a Feature from an association table result.

Classifies the feature table's FK columns into three disjoint sets: asset_columns (FK to an asset table), term_columns (FK to a vocabulary table), and value_columns (everything else). The association FKs linking back to the target table and to the feature name vocabulary are excluded before classification.

Parameters:

Name Type Description Default
atable FindAssociationResult

Result from deriva.core.ermrest_model.FindAssociationResult describing the feature association table. Provides the feature table, the self-FK back to the target, and the set of other FKs.

required
model DerivaModel

DerivaModel instance used to classify FK targets as asset or vocabulary tables.

required
Note

This constructor is not part of the public API. Obtain Feature instances via DerivaML.create_feature or DerivaML.lookup_feature.

Source code in src/deriva_ml/feature.py
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
def __init__(self, atable: FindAssociationResult, model: "DerivaModel") -> None:
    """Initialize a Feature from an association table result.

    Classifies the feature table's FK columns into three disjoint sets:
    ``asset_columns`` (FK to an asset table), ``term_columns`` (FK to a
    vocabulary table), and ``value_columns`` (everything else). The
    association FKs linking back to the target table and to the feature
    name vocabulary are excluded before classification.

    Args:
        atable: Result from ``deriva.core.ermrest_model.FindAssociationResult``
            describing the feature association table. Provides the feature
            table, the self-FK back to the target, and the set of other FKs.
        model: ``DerivaModel`` instance used to classify FK targets as
            asset or vocabulary tables.

    Note:
        This constructor is not part of the public API. Obtain ``Feature``
        instances via ``DerivaML.create_feature`` or
        ``DerivaML.lookup_feature``.
    """
    self.feature_table = atable.table
    self.target_table = atable.self_fkey.pk_table
    self.feature_name = atable.table.columns["Feature_Name"].default
    self._model = model

    skip_columns = {
        "RID",
        "RMB",
        "RCB",
        "RCT",
        "RMT",
        "Feature_Name",
        self.target_table.name,
        "Execution",
    }
    self.feature_columns = {c for c in self.feature_table.columns if c.name not in skip_columns}

    # Exclude the two FKs that are structural parts of the association table
    # itself — the self-FK pointing back to the target table (e.g., Image)
    # and the other-FKs pointing to Feature_Name and Execution — before
    # classifying the remaining FKs as asset, term, or value columns. Without
    # this subtraction, those structural FKs would be misclassified as feature
    # columns and create spurious fields in the generated FeatureRecord class.
    assoc_fkeys = {atable.self_fkey} | atable.other_fkeys

    # Determine the role of each column in the feature outside the FK columns.
    self.asset_columns = {
        fk.foreign_key_columns[0]
        for fk in self.feature_table.foreign_keys
        if fk not in assoc_fkeys and self._model.is_asset(fk.pk_table)
    }

    self.term_columns = {
        fk.foreign_key_columns[0]
        for fk in self.feature_table.foreign_keys
        if fk not in assoc_fkeys and self._model.is_vocabulary(fk.pk_table)
    }

    self.value_columns = self.feature_columns - (self.asset_columns | self.term_columns)

feature_record_class

feature_record_class() -> type[
    FeatureRecord
]

Create a dynamically generated Pydantic model class for this feature.

Builds a FeatureRecord subclass with fields derived from the feature table's columns. Column types are mapped as follows:

  • Term columns (FK to vocabulary): str (vocabulary term name)
  • Asset columns (FK to asset table): str | Path (file path)
  • Value columns (direct data): typed per the database column type (int, float, bool, or str)

All feature-specific fields are Optional with a default of None to allow partial construction when building records for insertion. The Feature_Name field defaults to this feature's name.

Parameters:

Name Type Description Default
self

The Feature instance whose schema drives field generation.

required

Returns:

Type Description
type[FeatureRecord]

A subclass of FeatureRecord whose fields match this feature's

type[FeatureRecord]

schema. The class's feature ClassVar is set to self.

Raises:

Type Description
DerivaMLException

If the feature table schema cannot be read.

Example

feature = ml.lookup_feature("Image", "Diagnosis") # doctest: +SKIP DiagnosisRecord = feature.feature_record_class() # doctest: +SKIP rec = DiagnosisRecord(Diagnosis="benign") # doctest: +SKIP

Source code in src/deriva_ml/feature.py
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
def feature_record_class(self) -> type[FeatureRecord]:
    """Create a dynamically generated Pydantic model class for this feature.

    Builds a ``FeatureRecord`` subclass with fields derived from the feature
    table's columns. Column types are mapped as follows:

    - Term columns (FK to vocabulary): ``str`` (vocabulary term name)
    - Asset columns (FK to asset table): ``str | Path`` (file path)
    - Value columns (direct data): typed per the database column type
      (``int``, ``float``, ``bool``, or ``str``)

    All feature-specific fields are ``Optional`` with a default of ``None``
    to allow partial construction when building records for insertion.
    The ``Feature_Name`` field defaults to this feature's name.

    Args:
        self: The ``Feature`` instance whose schema drives field generation.

    Returns:
        A subclass of ``FeatureRecord`` whose fields match this feature's
        schema. The class's ``feature`` ClassVar is set to ``self``.

    Raises:
        DerivaMLException: If the feature table schema cannot be read.

    Example:
        >>> feature = ml.lookup_feature("Image", "Diagnosis")  # doctest: +SKIP
        >>> DiagnosisRecord = feature.feature_record_class()  # doctest: +SKIP
        >>> rec = DiagnosisRecord(Diagnosis="benign")  # doctest: +SKIP
    """

    def map_type(c: Column) -> UnionType | Type[str] | Type[int] | Type[float]:
        """Maps a Deriva column type to a Python/pydantic type.

        Converts ERMrest column types to appropriate Python types for use in pydantic models.
        Special handling is provided for asset columns which can accept either strings or Path objects.

        Args:
            c: ERMrest column to map to a Python type.

        Returns:
            UnionType | Type[str] | Type[int] | Type[float]: Appropriate Python type for the column:
                - str | Path for asset columns
                - str for text columns
                - int for integer columns
                - float for floating point columns
                - str for all other types

        Example:
            >>> col = Column(name="score", type="float4")
            >>> typ = map_type(col)  # Returns float
        """
        if c.name in {c.name for c in self.asset_columns}:
            return str | Path

        match c.type.typename:
            case "text":
                return str
            case "int2" | "int4" | "int8":
                return int
            case "float4" | "float8":
                return float
            case "boolean":
                return bool
            case _:
                return str

    featureclass_name = f"{self.target_table.name}Feature{self.feature_name}"

    # Create feature class. To do this, we must determine the python type for each column and also if the
    # column is optional or not based on its nullability.
    feature_columns = {
        c.name: (
            Optional[map_type(c)] if c.nullok else map_type(c),
            c.default or None,
        )
        for c in self.feature_columns
    } | {
        "Feature_Name": (
            str,
            self.feature_name,
        ),  # Set default value for Feature_Name
        self.target_table.name: (str, ...),
    }
    docstring = (
        f"Class to capture fields in a feature {self.feature_name} on table {self.target_table}. "
        "Feature columns include:\n"
    )
    docstring += "\n".join([f"    {c.name}" for c in self.feature_columns])

    model = create_model(
        featureclass_name,
        __base__=FeatureRecord,
        __doc__=docstring,
        **feature_columns,
    )
    model.feature = self  # Set value of class variable within the feature class definition.

    return model

FeatureRecord

Bases: BaseModel

Base class for dynamically generated feature record models.

This class serves as the base for pydantic models that represent feature records. Each feature record contains the values and metadata associated with a feature instance. Subclasses are created dynamically by Feature.feature_record_class() with fields corresponding to the feature's vocabulary terms, asset references, and metadata columns.

Feature records are returned by feature_values(). They can also be constructed manually and passed to Execution.add_features() to insert new values into the catalog.

Handling multiple values per target:

When the same target object (e.g., an Image) has multiple feature values — for example, labels from different annotators or model runs — use a selector function to choose one. Pass it to feature_values. A selector receives a list of FeatureRecord instances for the same target and returns the selected one::

# Built-in: pick the most recently created record
for rec in ml.feature_values("Image", selector=FeatureRecord.select_newest):
    ...

# Custom: pick the record with highest confidence
def select_best(records):
    return max(records, key=lambda r: getattr(r, "Confidence", 0))

for rec in ml.feature_values("Image", selector=select_best):
    ...

Attributes:

Name Type Description
Execution Optional[str]

RID of the execution that created this feature record. Links to the Execution table for provenance tracking — use this to trace which workflow run produced this value.

Feature_Name str

Name of the feature this record belongs to.

RCT Optional[str]

Row Creation Time — an ISO 8601 timestamp string (e.g., "2024-06-15T10:30:00.000000+00:00"). Populated automatically when reading from the catalog or a dataset bag. Used by select_newest to determine recency.

feature ClassVar[Optional[Feature]]

Reference to the Feature definition object. Set automatically by feature_record_class() when the dynamic subclass is created. None on the base class. Provides access to the feature's column metadata, target table, and vocabulary/asset column sets.

Source code in src/deriva_ml/feature.py
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
class FeatureRecord(BaseModel):
    """Base class for dynamically generated feature record models.

    This class serves as the base for pydantic models that represent feature
    records. Each feature record contains the values and metadata associated
    with a feature instance. Subclasses are created dynamically by
    ``Feature.feature_record_class()`` with fields corresponding to the
    feature's vocabulary terms, asset references, and metadata columns.

    Feature records are returned by ``feature_values()``. They can also be
    constructed manually and passed to ``Execution.add_features()`` to insert
    new values into the catalog.

    **Handling multiple values per target:**

    When the same target object (e.g., an Image) has multiple feature values
    — for example, labels from different annotators or model runs — use
    a ``selector`` function to choose one. Pass it to ``feature_values``.
    A selector receives a list of FeatureRecord instances for the same target
    and returns the selected one::

        # Built-in: pick the most recently created record
        for rec in ml.feature_values("Image", selector=FeatureRecord.select_newest):
            ...

        # Custom: pick the record with highest confidence
        def select_best(records):
            return max(records, key=lambda r: getattr(r, "Confidence", 0))

        for rec in ml.feature_values("Image", selector=select_best):
            ...

    Attributes:
        Execution (Optional[str]): RID of the execution that created this
            feature record. Links to the ``Execution`` table for provenance
            tracking — use this to trace which workflow run produced this value.
        Feature_Name (str): Name of the feature this record belongs to.
        RCT (Optional[str]): Row Creation Time — an ISO 8601 timestamp string
            (e.g., ``"2024-06-15T10:30:00.000000+00:00"``). Populated
            automatically when reading from the catalog or a dataset bag.
            Used by ``select_newest`` to determine recency.
        feature (ClassVar[Optional[Feature]]): Reference to the Feature
            definition object. Set automatically by ``feature_record_class()``
            when the dynamic subclass is created. ``None`` on the base class.
            Provides access to the feature's column metadata, target table,
            and vocabulary/asset column sets.
    """

    # model_dump of this feature should be compatible with feature table columns.
    Execution: Optional[str] = None
    Feature_Name: str
    RCT: Optional[str] = None
    feature: ClassVar[Optional["Feature"]] = None

    class Config:
        arbitrary_types_allowed = True
        extra = "forbid"

    @staticmethod
    def select_newest(records: list["FeatureRecord"]) -> "FeatureRecord":
        """Select the feature record with the most recent creation time.

        Uses the RCT (Row Creation Time) field to determine recency. RCT is
        an ISO 8601 timestamp string, so lexicographic comparison correctly
        identifies the most recent record. Records with ``None`` RCT are
        treated as older than any timestamped record.

        This method is designed to be passed directly as the ``selector``
        argument to ``feature_values``::

            for rec in ml.feature_values("Image", selector=FeatureRecord.select_newest):
                ...

        Args:
            records: List of FeatureRecord instances for the same target
                object. Must be non-empty.

        Returns:
            The FeatureRecord with the latest RCT value.
        """
        return max(records, key=lambda r: r.RCT or "")

    @staticmethod
    def select_by_execution(execution_rid: str):
        """Return a selector that picks the newest record from a specific execution.

        Creates a selector function that filters records to those produced by
        the given execution, then returns the newest match by RCT. This is
        useful when multiple executions have produced values for the same
        feature and you want results from a specific run.

        Unlike ``select_by_workflow`` (a factory that resolves the workflow's
        execution set from a container), this selector filters on a known
        ``Execution`` RID with no container dependency and can be passed
        directly as the ``selector`` argument to ``feature_values``::

            for rec in ml.feature_values(
                "Image",
                feature_name="FooBar",
                selector=FeatureRecord.select_by_execution("3WY2"),
            ):
                ...

        Args:
            execution_rid: RID of the execution to filter by.

        Returns:
            A selector function ``(list[FeatureRecord]) -> FeatureRecord``
            suitable for use as the ``selector=`` argument to ``feature_values``.

        Raises:
            DerivaMLException: If no records in the group match the
                given execution RID.

        Examples:
            Select values from a specific execution::

                >>> for rec in ml.feature_values(  # doctest: +SKIP
                ...     "Image",
                ...     feature_name="Classification",
                ...     selector=FeatureRecord.select_by_execution("3WY2"),
                ... ):
                ...     print(rec)
        """

        def _selector(records: list["FeatureRecord"]) -> "FeatureRecord":
            filtered = [r for r in records if r.Execution == execution_rid]
            if not filtered:
                from deriva_ml.core.exceptions import DerivaMLException

                raise DerivaMLException(
                    f"No feature records match execution '{execution_rid}'."
                )
            return FeatureRecord.select_newest(filtered)

        return _selector

    @classmethod
    def select_by_workflow(
        cls, workflow: str, *, container
    ) -> "Callable[[list[FeatureRecord]], FeatureRecord | None]":
        """Return a selector that picks the newest record from a specific workflow.

        Creates a selector function that filters records to those produced by
        executions of the given workflow, then returns the newest match by RCT.
        This is the recommended replacement for the retired
        ``DerivaML.select_by_workflow(records, workflow)`` method.

        Unlike ``select_by_execution``, which requires knowing a specific
        execution RID, this selector works at the workflow level — it accepts
        any record produced by any execution of the named workflow.

        **Eager resolution:** the workflow's execution list is resolved once
        at factory-construction time by calling
        ``container.list_workflow_executions(workflow)``. Unknown-workflow
        errors therefore surface immediately (at factory-call time), not
        lazily during iteration.

        **None return semantics:** when no record in a group matches the
        workflow, the selector returns ``None``. ``feature_values`` treats
        ``None`` as "feature absent for this target RID" and omits the target
        from the iterator. This is distinct from ``select_by_execution``, which
        raises on no-match.

        Args:
            workflow: Name (or RID) of the workflow to filter by. Must be a
                workflow known to ``container``; an unknown name raises
                ``DerivaMLException`` immediately.
            container: Required keyword-only argument. An object that
                implements ``list_workflow_executions(workflow) -> list[str]``.
                Typically a ``DerivaML``, ``Dataset``, or ``DatasetBag``
                instance. The container determines which executions are in
                scope (all catalog executions for ``DerivaML``; dataset-scoped
                executions for ``Dataset`` / ``DatasetBag``).

        Returns:
            A selector callable ``(list[FeatureRecord]) -> FeatureRecord | None``
            suitable for use as the ``selector=`` argument to
            ``feature_values``. Returns ``None``
            when no record in the group matches the workflow; returns the
            newest matching record (by RCT) otherwise.

        Raises:
            DerivaMLException: If ``workflow`` is not known to ``container``.
                Raised at factory-construction time (eager resolution).
            TypeError: If ``container`` is passed positionally (it is
                keyword-only).

        Example:
            Select Glaucoma labels produced by a specific training workflow::

                >>> selector = FeatureRecord.select_by_workflow(  # doctest: +SKIP
                ...     "Glaucoma_Training_v2", container=ml
                ... )
                >>> for rec in ml.feature_values(  # doctest: +SKIP
                ...     "Image", "Glaucoma", selector=selector
                ... ):
                ...     print(f"{rec.Image}: {rec.Glaucoma}")

            Works identically on a downloaded bag (offline)::

                >>> selector = FeatureRecord.select_by_workflow(  # doctest: +SKIP
                ...     "Glaucoma_Training_v2", container=bag
                ... )
                >>> labels = list(bag.feature_values("Image", "Glaucoma", selector=selector))  # doctest: +SKIP
        """
        # Eager resolution: fail fast on unknown workflow at construction time,
        # not lazily during iteration. Convert to a set for O(1) membership
        # testing inside the closure.
        execution_rids: set[str] = set(container.list_workflow_executions(workflow))

        def _selector(records: list["FeatureRecord"]) -> "FeatureRecord | None":
            matched = [r for r in records if r.Execution in execution_rids]
            if not matched:
                # Return None so feature_values omits this target RID silently.
                return None
            return FeatureRecord.select_newest(matched)

        return _selector

    @staticmethod
    def select_first(records: list["FeatureRecord"]) -> "FeatureRecord":
        """Select the feature record with the earliest creation time.

        Uses the RCT (Row Creation Time) field. Records with ``None`` RCT
        are treated as older than any timestamped record (since empty string
        sorts before any ISO 8601 timestamp).

        Useful when you want to preserve the original annotation and ignore
        later revisions.

        This method is designed to be passed directly as the ``selector``
        argument to ``feature_values``::

            for rec in ml.feature_values("Image", selector=FeatureRecord.select_first):
                ...

        Args:
            records: List of FeatureRecord instances for the same target
                object. Must be non-empty.

        Returns:
            The FeatureRecord with the earliest RCT value.
        """
        return min(records, key=lambda r: r.RCT or "")

    @staticmethod
    def select_latest(records: list["FeatureRecord"]) -> "FeatureRecord":
        """Select the most recently created feature record.

        Alias for ``select_newest``. Included for API symmetry with
        ``select_first``.

        Args:
            records: List of FeatureRecord instances for the same target
                object. Must be non-empty.

        Returns:
            The FeatureRecord with the latest RCT value.
        """
        return FeatureRecord.select_newest(records)

    @classmethod
    def select_majority_vote(cls, column: str | None = None):
        """Return a selector that picks the most common value for a column.

        Creates a selector function that counts the values of the specified
        column across all records, picks the most frequent one, and breaks
        ties by most recent RCT.

        For single-term features, the column can be auto-detected from the
        feature's metadata. For multi-term features, the column must be
        specified explicitly.

        This is useful for consensus labeling, where multiple annotators
        have labeled the same record and you want the majority opinion::

            selector = RecordClass.select_majority_vote()
            for rec in ml.feature_values(
                "Image",
                feature_name="Diagnosis",
                selector=selector,
            ):
                ...

        Args:
            column: Name of the column to count values for. If None,
                auto-detects the first term column from feature metadata.

        Returns:
            A selector function ``(list[FeatureRecord]) -> FeatureRecord``.

        Raises:
            DerivaMLException: If column is None and the feature has no
                term columns or multiple term columns.
        """

        def _selector(records: list["FeatureRecord"]) -> "FeatureRecord":
            col = column
            if col is None:
                # Auto-detect from feature metadata on the record class
                record_cls = type(records[0])
                if (
                    hasattr(record_cls, "feature")
                    and record_cls.feature
                    and record_cls.feature.term_columns
                ):
                    if len(record_cls.feature.term_columns) == 1:
                        col = record_cls.feature.term_columns[0].name
                    else:
                        from deriva_ml.core.exceptions import (
                            DerivaMLException,
                        )

                        raise DerivaMLException(
                            "select_majority_vote requires a column name for "
                            "features with multiple term columns. "
                            f"Available: {[c.name for c in record_cls.feature.term_columns]}"
                        )
                else:
                    from deriva_ml.core.exceptions import (
                        DerivaMLException,
                    )

                    raise DerivaMLException(
                        "select_majority_vote requires a column name — "
                        "could not auto-detect from feature metadata."
                    )

            from collections import Counter

            counts = Counter(getattr(r, col, None) for r in records)
            max_count = max(counts.values())
            majority_values = {v for v, c in counts.items() if c == max_count}
            candidates = [
                r for r in records if getattr(r, col, None) in majority_values
            ]
            return max(candidates, key=lambda r: r.RCT or "")

        return _selector

    @classmethod
    def feature_columns(cls) -> set[Column]:
        """Return all columns specific to this feature.

        Returns the full set of feature-specific columns — the union of
        ``asset_columns``, ``term_columns``, and ``value_columns``. System
        columns (``RID``, ``RCT``, ``RMT``, ``RCB``, ``RMB``) and structural
        association columns (``Feature_Name``, the target-table FK, and
        ``Execution``) are excluded.

        Returns:
            set[Column]: Feature-specific ERMrest ``Column`` objects. Equivalent
            to ``cls.feature.feature_columns``.

        Note:
            Only available on a class returned by ``Feature.feature_record_class()``.
            Calling this on the ``FeatureRecord`` base class (where ``feature``
            is ``None``) raises ``AttributeError``.
        """
        return cls.feature.feature_columns

    @classmethod
    def asset_columns(cls) -> set[Column]:
        """Return columns that reference asset tables.

        Asset columns are FK columns whose referent table is classified as an
        asset table (e.g., ``Image``, ``Scan``). In a generated
        ``FeatureRecord`` subclass these fields accept ``str | Path`` values.

        Returns:
            set[Column]: ERMrest ``Column`` objects that are FK references to
            asset tables. A subset of ``feature_columns()``.

        Note:
            Only available on a class returned by ``Feature.feature_record_class()``.
            Calling this on the ``FeatureRecord`` base class (where ``feature``
            is ``None``) raises ``AttributeError``.
        """
        return cls.feature.asset_columns

    @classmethod
    def term_columns(cls) -> set[Column]:
        """Return columns that reference controlled vocabulary terms.

        Term columns are FK columns whose referent table is classified as a
        vocabulary table. In a generated ``FeatureRecord`` subclass these
        fields accept ``str`` values (the term name, not the RID).

        Returns:
            set[Column]: ERMrest ``Column`` objects that are FK references to
            vocabulary tables. A subset of ``feature_columns()``.

        Note:
            Only available on a class returned by ``Feature.feature_record_class()``.
            Calling this on the ``FeatureRecord`` base class (where ``feature``
            is ``None``) raises ``AttributeError``.
        """
        return cls.feature.term_columns

    @classmethod
    def value_columns(cls) -> set[Column]:
        """Return columns that contain direct (non-FK) values.

        Value columns hold scalar data — integers, floats, booleans, or text
        — rather than FK references to other tables. In a generated
        ``FeatureRecord`` subclass these fields are typed according to the
        ERMrest column type (``int``, ``float``, ``bool``, or ``str``).

        Returns:
            set[Column]: ERMrest ``Column`` objects that contain direct data
            values. Computed as ``feature_columns() - asset_columns() -
            term_columns()``.

        Note:
            Only available on a class returned by ``Feature.feature_record_class()``.
            Calling this on the ``FeatureRecord`` base class (where ``feature``
            is ``None``) raises ``AttributeError``.
        """
        return cls.feature.value_columns

asset_columns classmethod

asset_columns() -> set[Column]

Return columns that reference asset tables.

Asset columns are FK columns whose referent table is classified as an asset table (e.g., Image, Scan). In a generated FeatureRecord subclass these fields accept str | Path values.

Returns:

Type Description
set[Column]

set[Column]: ERMrest Column objects that are FK references to

set[Column]

asset tables. A subset of feature_columns().

Note

Only available on a class returned by Feature.feature_record_class(). Calling this on the FeatureRecord base class (where feature is None) raises AttributeError.

Source code in src/deriva_ml/feature.py
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
@classmethod
def asset_columns(cls) -> set[Column]:
    """Return columns that reference asset tables.

    Asset columns are FK columns whose referent table is classified as an
    asset table (e.g., ``Image``, ``Scan``). In a generated
    ``FeatureRecord`` subclass these fields accept ``str | Path`` values.

    Returns:
        set[Column]: ERMrest ``Column`` objects that are FK references to
        asset tables. A subset of ``feature_columns()``.

    Note:
        Only available on a class returned by ``Feature.feature_record_class()``.
        Calling this on the ``FeatureRecord`` base class (where ``feature``
        is ``None``) raises ``AttributeError``.
    """
    return cls.feature.asset_columns

feature_columns classmethod

feature_columns() -> set[Column]

Return all columns specific to this feature.

Returns the full set of feature-specific columns — the union of asset_columns, term_columns, and value_columns. System columns (RID, RCT, RMT, RCB, RMB) and structural association columns (Feature_Name, the target-table FK, and Execution) are excluded.

Returns:

Type Description
set[Column]

set[Column]: Feature-specific ERMrest Column objects. Equivalent

set[Column]

to cls.feature.feature_columns.

Note

Only available on a class returned by Feature.feature_record_class(). Calling this on the FeatureRecord base class (where feature is None) raises AttributeError.

Source code in src/deriva_ml/feature.py
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
@classmethod
def feature_columns(cls) -> set[Column]:
    """Return all columns specific to this feature.

    Returns the full set of feature-specific columns — the union of
    ``asset_columns``, ``term_columns``, and ``value_columns``. System
    columns (``RID``, ``RCT``, ``RMT``, ``RCB``, ``RMB``) and structural
    association columns (``Feature_Name``, the target-table FK, and
    ``Execution``) are excluded.

    Returns:
        set[Column]: Feature-specific ERMrest ``Column`` objects. Equivalent
        to ``cls.feature.feature_columns``.

    Note:
        Only available on a class returned by ``Feature.feature_record_class()``.
        Calling this on the ``FeatureRecord`` base class (where ``feature``
        is ``None``) raises ``AttributeError``.
    """
    return cls.feature.feature_columns

select_by_execution staticmethod

select_by_execution(execution_rid: str)

Return a selector that picks the newest record from a specific execution.

Creates a selector function that filters records to those produced by the given execution, then returns the newest match by RCT. This is useful when multiple executions have produced values for the same feature and you want results from a specific run.

Unlike select_by_workflow (a factory that resolves the workflow's execution set from a container), this selector filters on a known Execution RID with no container dependency and can be passed directly as the selector argument to feature_values::

for rec in ml.feature_values(
    "Image",
    feature_name="FooBar",
    selector=FeatureRecord.select_by_execution("3WY2"),
):
    ...

Parameters:

Name Type Description Default
execution_rid str

RID of the execution to filter by.

required

Returns:

Type Description

A selector function (list[FeatureRecord]) -> FeatureRecord

suitable for use as the selector= argument to feature_values.

Raises:

Type Description
DerivaMLException

If no records in the group match the given execution RID.

Examples:

Select values from a specific execution::

>>> for rec in ml.feature_values(  # doctest: +SKIP
...     "Image",
...     feature_name="Classification",
...     selector=FeatureRecord.select_by_execution("3WY2"),
... ):
...     print(rec)
Source code in src/deriva_ml/feature.py
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
@staticmethod
def select_by_execution(execution_rid: str):
    """Return a selector that picks the newest record from a specific execution.

    Creates a selector function that filters records to those produced by
    the given execution, then returns the newest match by RCT. This is
    useful when multiple executions have produced values for the same
    feature and you want results from a specific run.

    Unlike ``select_by_workflow`` (a factory that resolves the workflow's
    execution set from a container), this selector filters on a known
    ``Execution`` RID with no container dependency and can be passed
    directly as the ``selector`` argument to ``feature_values``::

        for rec in ml.feature_values(
            "Image",
            feature_name="FooBar",
            selector=FeatureRecord.select_by_execution("3WY2"),
        ):
            ...

    Args:
        execution_rid: RID of the execution to filter by.

    Returns:
        A selector function ``(list[FeatureRecord]) -> FeatureRecord``
        suitable for use as the ``selector=`` argument to ``feature_values``.

    Raises:
        DerivaMLException: If no records in the group match the
            given execution RID.

    Examples:
        Select values from a specific execution::

            >>> for rec in ml.feature_values(  # doctest: +SKIP
            ...     "Image",
            ...     feature_name="Classification",
            ...     selector=FeatureRecord.select_by_execution("3WY2"),
            ... ):
            ...     print(rec)
    """

    def _selector(records: list["FeatureRecord"]) -> "FeatureRecord":
        filtered = [r for r in records if r.Execution == execution_rid]
        if not filtered:
            from deriva_ml.core.exceptions import DerivaMLException

            raise DerivaMLException(
                f"No feature records match execution '{execution_rid}'."
            )
        return FeatureRecord.select_newest(filtered)

    return _selector

select_by_workflow classmethod

select_by_workflow(
    workflow: str, *, container
) -> Callable[
    [list[FeatureRecord]],
    FeatureRecord | None,
]

Return a selector that picks the newest record from a specific workflow.

Creates a selector function that filters records to those produced by executions of the given workflow, then returns the newest match by RCT. This is the recommended replacement for the retired DerivaML.select_by_workflow(records, workflow) method.

Unlike select_by_execution, which requires knowing a specific execution RID, this selector works at the workflow level — it accepts any record produced by any execution of the named workflow.

Eager resolution: the workflow's execution list is resolved once at factory-construction time by calling container.list_workflow_executions(workflow). Unknown-workflow errors therefore surface immediately (at factory-call time), not lazily during iteration.

None return semantics: when no record in a group matches the workflow, the selector returns None. feature_values treats None as "feature absent for this target RID" and omits the target from the iterator. This is distinct from select_by_execution, which raises on no-match.

Parameters:

Name Type Description Default
workflow str

Name (or RID) of the workflow to filter by. Must be a workflow known to container; an unknown name raises DerivaMLException immediately.

required
container

Required keyword-only argument. An object that implements list_workflow_executions(workflow) -> list[str]. Typically a DerivaML, Dataset, or DatasetBag instance. The container determines which executions are in scope (all catalog executions for DerivaML; dataset-scoped executions for Dataset / DatasetBag).

required

Returns:

Type Description
Callable[[list[FeatureRecord]], FeatureRecord | None]

A selector callable (list[FeatureRecord]) -> FeatureRecord | None

Callable[[list[FeatureRecord]], FeatureRecord | None]

suitable for use as the selector= argument to

Callable[[list[FeatureRecord]], FeatureRecord | None]

feature_values. Returns None

Callable[[list[FeatureRecord]], FeatureRecord | None]

when no record in the group matches the workflow; returns the

Callable[[list[FeatureRecord]], FeatureRecord | None]

newest matching record (by RCT) otherwise.

Raises:

Type Description
DerivaMLException

If workflow is not known to container. Raised at factory-construction time (eager resolution).

TypeError

If container is passed positionally (it is keyword-only).

Example

Select Glaucoma labels produced by a specific training workflow::

>>> selector = FeatureRecord.select_by_workflow(  # doctest: +SKIP
...     "Glaucoma_Training_v2", container=ml
... )
>>> for rec in ml.feature_values(  # doctest: +SKIP
...     "Image", "Glaucoma", selector=selector
... ):
...     print(f"{rec.Image}: {rec.Glaucoma}")

Works identically on a downloaded bag (offline)::

>>> selector = FeatureRecord.select_by_workflow(  # doctest: +SKIP
...     "Glaucoma_Training_v2", container=bag
... )
>>> labels = list(bag.feature_values("Image", "Glaucoma", selector=selector))  # doctest: +SKIP
Source code in src/deriva_ml/feature.py
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
@classmethod
def select_by_workflow(
    cls, workflow: str, *, container
) -> "Callable[[list[FeatureRecord]], FeatureRecord | None]":
    """Return a selector that picks the newest record from a specific workflow.

    Creates a selector function that filters records to those produced by
    executions of the given workflow, then returns the newest match by RCT.
    This is the recommended replacement for the retired
    ``DerivaML.select_by_workflow(records, workflow)`` method.

    Unlike ``select_by_execution``, which requires knowing a specific
    execution RID, this selector works at the workflow level — it accepts
    any record produced by any execution of the named workflow.

    **Eager resolution:** the workflow's execution list is resolved once
    at factory-construction time by calling
    ``container.list_workflow_executions(workflow)``. Unknown-workflow
    errors therefore surface immediately (at factory-call time), not
    lazily during iteration.

    **None return semantics:** when no record in a group matches the
    workflow, the selector returns ``None``. ``feature_values`` treats
    ``None`` as "feature absent for this target RID" and omits the target
    from the iterator. This is distinct from ``select_by_execution``, which
    raises on no-match.

    Args:
        workflow: Name (or RID) of the workflow to filter by. Must be a
            workflow known to ``container``; an unknown name raises
            ``DerivaMLException`` immediately.
        container: Required keyword-only argument. An object that
            implements ``list_workflow_executions(workflow) -> list[str]``.
            Typically a ``DerivaML``, ``Dataset``, or ``DatasetBag``
            instance. The container determines which executions are in
            scope (all catalog executions for ``DerivaML``; dataset-scoped
            executions for ``Dataset`` / ``DatasetBag``).

    Returns:
        A selector callable ``(list[FeatureRecord]) -> FeatureRecord | None``
        suitable for use as the ``selector=`` argument to
        ``feature_values``. Returns ``None``
        when no record in the group matches the workflow; returns the
        newest matching record (by RCT) otherwise.

    Raises:
        DerivaMLException: If ``workflow`` is not known to ``container``.
            Raised at factory-construction time (eager resolution).
        TypeError: If ``container`` is passed positionally (it is
            keyword-only).

    Example:
        Select Glaucoma labels produced by a specific training workflow::

            >>> selector = FeatureRecord.select_by_workflow(  # doctest: +SKIP
            ...     "Glaucoma_Training_v2", container=ml
            ... )
            >>> for rec in ml.feature_values(  # doctest: +SKIP
            ...     "Image", "Glaucoma", selector=selector
            ... ):
            ...     print(f"{rec.Image}: {rec.Glaucoma}")

        Works identically on a downloaded bag (offline)::

            >>> selector = FeatureRecord.select_by_workflow(  # doctest: +SKIP
            ...     "Glaucoma_Training_v2", container=bag
            ... )
            >>> labels = list(bag.feature_values("Image", "Glaucoma", selector=selector))  # doctest: +SKIP
    """
    # Eager resolution: fail fast on unknown workflow at construction time,
    # not lazily during iteration. Convert to a set for O(1) membership
    # testing inside the closure.
    execution_rids: set[str] = set(container.list_workflow_executions(workflow))

    def _selector(records: list["FeatureRecord"]) -> "FeatureRecord | None":
        matched = [r for r in records if r.Execution in execution_rids]
        if not matched:
            # Return None so feature_values omits this target RID silently.
            return None
        return FeatureRecord.select_newest(matched)

    return _selector

select_first staticmethod

select_first(
    records: list[FeatureRecord],
) -> FeatureRecord

Select the feature record with the earliest creation time.

Uses the RCT (Row Creation Time) field. Records with None RCT are treated as older than any timestamped record (since empty string sorts before any ISO 8601 timestamp).

Useful when you want to preserve the original annotation and ignore later revisions.

This method is designed to be passed directly as the selector argument to feature_values::

for rec in ml.feature_values("Image", selector=FeatureRecord.select_first):
    ...

Parameters:

Name Type Description Default
records list[FeatureRecord]

List of FeatureRecord instances for the same target object. Must be non-empty.

required

Returns:

Type Description
FeatureRecord

The FeatureRecord with the earliest RCT value.

Source code in src/deriva_ml/feature.py
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
@staticmethod
def select_first(records: list["FeatureRecord"]) -> "FeatureRecord":
    """Select the feature record with the earliest creation time.

    Uses the RCT (Row Creation Time) field. Records with ``None`` RCT
    are treated as older than any timestamped record (since empty string
    sorts before any ISO 8601 timestamp).

    Useful when you want to preserve the original annotation and ignore
    later revisions.

    This method is designed to be passed directly as the ``selector``
    argument to ``feature_values``::

        for rec in ml.feature_values("Image", selector=FeatureRecord.select_first):
            ...

    Args:
        records: List of FeatureRecord instances for the same target
            object. Must be non-empty.

    Returns:
        The FeatureRecord with the earliest RCT value.
    """
    return min(records, key=lambda r: r.RCT or "")

select_latest staticmethod

select_latest(
    records: list[FeatureRecord],
) -> FeatureRecord

Select the most recently created feature record.

Alias for select_newest. Included for API symmetry with select_first.

Parameters:

Name Type Description Default
records list[FeatureRecord]

List of FeatureRecord instances for the same target object. Must be non-empty.

required

Returns:

Type Description
FeatureRecord

The FeatureRecord with the latest RCT value.

Source code in src/deriva_ml/feature.py
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
@staticmethod
def select_latest(records: list["FeatureRecord"]) -> "FeatureRecord":
    """Select the most recently created feature record.

    Alias for ``select_newest``. Included for API symmetry with
    ``select_first``.

    Args:
        records: List of FeatureRecord instances for the same target
            object. Must be non-empty.

    Returns:
        The FeatureRecord with the latest RCT value.
    """
    return FeatureRecord.select_newest(records)

select_majority_vote classmethod

select_majority_vote(
    column: str | None = None,
)

Return a selector that picks the most common value for a column.

Creates a selector function that counts the values of the specified column across all records, picks the most frequent one, and breaks ties by most recent RCT.

For single-term features, the column can be auto-detected from the feature's metadata. For multi-term features, the column must be specified explicitly.

This is useful for consensus labeling, where multiple annotators have labeled the same record and you want the majority opinion::

selector = RecordClass.select_majority_vote()
for rec in ml.feature_values(
    "Image",
    feature_name="Diagnosis",
    selector=selector,
):
    ...

Parameters:

Name Type Description Default
column str | None

Name of the column to count values for. If None, auto-detects the first term column from feature metadata.

None

Returns:

Type Description

A selector function (list[FeatureRecord]) -> FeatureRecord.

Raises:

Type Description
DerivaMLException

If column is None and the feature has no term columns or multiple term columns.

Source code in src/deriva_ml/feature.py
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
@classmethod
def select_majority_vote(cls, column: str | None = None):
    """Return a selector that picks the most common value for a column.

    Creates a selector function that counts the values of the specified
    column across all records, picks the most frequent one, and breaks
    ties by most recent RCT.

    For single-term features, the column can be auto-detected from the
    feature's metadata. For multi-term features, the column must be
    specified explicitly.

    This is useful for consensus labeling, where multiple annotators
    have labeled the same record and you want the majority opinion::

        selector = RecordClass.select_majority_vote()
        for rec in ml.feature_values(
            "Image",
            feature_name="Diagnosis",
            selector=selector,
        ):
            ...

    Args:
        column: Name of the column to count values for. If None,
            auto-detects the first term column from feature metadata.

    Returns:
        A selector function ``(list[FeatureRecord]) -> FeatureRecord``.

    Raises:
        DerivaMLException: If column is None and the feature has no
            term columns or multiple term columns.
    """

    def _selector(records: list["FeatureRecord"]) -> "FeatureRecord":
        col = column
        if col is None:
            # Auto-detect from feature metadata on the record class
            record_cls = type(records[0])
            if (
                hasattr(record_cls, "feature")
                and record_cls.feature
                and record_cls.feature.term_columns
            ):
                if len(record_cls.feature.term_columns) == 1:
                    col = record_cls.feature.term_columns[0].name
                else:
                    from deriva_ml.core.exceptions import (
                        DerivaMLException,
                    )

                    raise DerivaMLException(
                        "select_majority_vote requires a column name for "
                        "features with multiple term columns. "
                        f"Available: {[c.name for c in record_cls.feature.term_columns]}"
                    )
            else:
                from deriva_ml.core.exceptions import (
                    DerivaMLException,
                )

                raise DerivaMLException(
                    "select_majority_vote requires a column name — "
                    "could not auto-detect from feature metadata."
                )

        from collections import Counter

        counts = Counter(getattr(r, col, None) for r in records)
        max_count = max(counts.values())
        majority_values = {v for v, c in counts.items() if c == max_count}
        candidates = [
            r for r in records if getattr(r, col, None) in majority_values
        ]
        return max(candidates, key=lambda r: r.RCT or "")

    return _selector

select_newest staticmethod

select_newest(
    records: list[FeatureRecord],
) -> FeatureRecord

Select the feature record with the most recent creation time.

Uses the RCT (Row Creation Time) field to determine recency. RCT is an ISO 8601 timestamp string, so lexicographic comparison correctly identifies the most recent record. Records with None RCT are treated as older than any timestamped record.

This method is designed to be passed directly as the selector argument to feature_values::

for rec in ml.feature_values("Image", selector=FeatureRecord.select_newest):
    ...

Parameters:

Name Type Description Default
records list[FeatureRecord]

List of FeatureRecord instances for the same target object. Must be non-empty.

required

Returns:

Type Description
FeatureRecord

The FeatureRecord with the latest RCT value.

Source code in src/deriva_ml/feature.py
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
@staticmethod
def select_newest(records: list["FeatureRecord"]) -> "FeatureRecord":
    """Select the feature record with the most recent creation time.

    Uses the RCT (Row Creation Time) field to determine recency. RCT is
    an ISO 8601 timestamp string, so lexicographic comparison correctly
    identifies the most recent record. Records with ``None`` RCT are
    treated as older than any timestamped record.

    This method is designed to be passed directly as the ``selector``
    argument to ``feature_values``::

        for rec in ml.feature_values("Image", selector=FeatureRecord.select_newest):
            ...

    Args:
        records: List of FeatureRecord instances for the same target
            object. Must be non-empty.

    Returns:
        The FeatureRecord with the latest RCT value.
    """
    return max(records, key=lambda r: r.RCT or "")

term_columns classmethod

term_columns() -> set[Column]

Return columns that reference controlled vocabulary terms.

Term columns are FK columns whose referent table is classified as a vocabulary table. In a generated FeatureRecord subclass these fields accept str values (the term name, not the RID).

Returns:

Type Description
set[Column]

set[Column]: ERMrest Column objects that are FK references to

set[Column]

vocabulary tables. A subset of feature_columns().

Note

Only available on a class returned by Feature.feature_record_class(). Calling this on the FeatureRecord base class (where feature is None) raises AttributeError.

Source code in src/deriva_ml/feature.py
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
@classmethod
def term_columns(cls) -> set[Column]:
    """Return columns that reference controlled vocabulary terms.

    Term columns are FK columns whose referent table is classified as a
    vocabulary table. In a generated ``FeatureRecord`` subclass these
    fields accept ``str`` values (the term name, not the RID).

    Returns:
        set[Column]: ERMrest ``Column`` objects that are FK references to
        vocabulary tables. A subset of ``feature_columns()``.

    Note:
        Only available on a class returned by ``Feature.feature_record_class()``.
        Calling this on the ``FeatureRecord`` base class (where ``feature``
        is ``None``) raises ``AttributeError``.
    """
    return cls.feature.term_columns

value_columns classmethod

value_columns() -> set[Column]

Return columns that contain direct (non-FK) values.

Value columns hold scalar data — integers, floats, booleans, or text — rather than FK references to other tables. In a generated FeatureRecord subclass these fields are typed according to the ERMrest column type (int, float, bool, or str).

Returns:

Type Description
set[Column]

set[Column]: ERMrest Column objects that contain direct data

set[Column]

values. Computed as ``feature_columns() - asset_columns() -

set[Column]

term_columns()``.

Note

Only available on a class returned by Feature.feature_record_class(). Calling this on the FeatureRecord base class (where feature is None) raises AttributeError.

Source code in src/deriva_ml/feature.py
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
@classmethod
def value_columns(cls) -> set[Column]:
    """Return columns that contain direct (non-FK) values.

    Value columns hold scalar data — integers, floats, booleans, or text
    — rather than FK references to other tables. In a generated
    ``FeatureRecord`` subclass these fields are typed according to the
    ERMrest column type (``int``, ``float``, ``bool``, or ``str``).

    Returns:
        set[Column]: ERMrest ``Column`` objects that contain direct data
        values. Computed as ``feature_columns() - asset_columns() -
        term_columns()``.

    Note:
        Only available on a class returned by ``Feature.feature_record_class()``.
        Calling this on the ``FeatureRecord`` base class (where ``feature``
        is ``None``) raises ``AttributeError``.
    """
    return cls.feature.value_columns