Skip to content

Feature Classes

Feature management for ML experiments. Features represent measurable properties or characteristics that can be attached to domain entities and tracked across executions.

Feature implementation for deriva-ml.

This module provides classes for defining and managing features in deriva-ml. Features represent measurable properties or characteristics that can be associated with records in a table. The module includes:

  • Feature: Main class for defining and managing features
  • FeatureRecord: Base class for feature records using pydantic models
Typical usage example

feature = Feature(association_result, model) FeatureClass = feature.feature_record_class() record = FeatureClass(value="high", confidence=0.95)

Feature

Manages feature definitions and their relationships in the catalog.

A Feature represents a measurable property or characteristic that can be associated with records in a table. Features can include asset references, controlled vocabulary terms, and custom metadata fields.

Attributes:

Name Type Description
feature_table Table

Table containing the feature implementation.

target_table Table

Table that the feature is associated with.

feature_name str

Name of the feature (from Feature_Name column default).

feature_columns set[Column]

All columns specific to this feature.

asset_columns set[Column]

Columns referencing asset tables.

term_columns set[Column]

Columns referencing vocabulary tables.

value_columns set[Column]

Columns containing direct values (not FK references).

Example

feature = Feature(association_result, model) print(f"Feature {feature.feature_name} on {feature.target_table.name}") print("Asset columns:", [c.name for c in feature.asset_columns])

Source code in src/deriva_ml/feature.py
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
class Feature:
    """Manages feature definitions and their relationships in the catalog.

    A Feature represents a measurable property or characteristic that can be associated with records in a table.
    Features can include asset references, controlled vocabulary terms, and custom metadata fields.

    Attributes:
        feature_table (Table): Table containing the feature implementation.
        target_table (Table): Table that the feature is associated with.
        feature_name (str): Name of the feature (from Feature_Name column default).
        feature_columns (set[Column]): All columns specific to this feature.
        asset_columns (set[Column]): Columns referencing asset tables.
        term_columns (set[Column]): Columns referencing vocabulary tables.
        value_columns (set[Column]): Columns containing direct values (not FK references).

    Example:
        >>> feature = Feature(association_result, model)
        >>> print(f"Feature {feature.feature_name} on {feature.target_table.name}")
        >>> print("Asset columns:", [c.name for c in feature.asset_columns])
    """

    def __init__(self, atable: FindAssociationResult, model: "DerivaModel") -> None:
        self.feature_table = atable.table
        self.target_table = atable.self_fkey.pk_table
        self.feature_name = atable.table.columns["Feature_Name"].default
        self._model = model

        skip_columns = {
            "RID",
            "RMB",
            "RCB",
            "RCT",
            "RMT",
            "Feature_Name",
            self.target_table.name,
            "Execution",
        }
        self.feature_columns = {c for c in self.feature_table.columns if c.name not in skip_columns}

        assoc_fkeys = {atable.self_fkey} | atable.other_fkeys

        # Determine the role of each column in the feature outside the FK columns.
        self.asset_columns = {
            fk.foreign_key_columns[0]
            for fk in self.feature_table.foreign_keys
            if fk not in assoc_fkeys and self._model.is_asset(fk.pk_table)
        }

        self.term_columns = {
            fk.foreign_key_columns[0]
            for fk in self.feature_table.foreign_keys
            if fk not in assoc_fkeys and self._model.is_vocabulary(fk.pk_table)
        }

        self.value_columns = self.feature_columns - (self.asset_columns | self.term_columns)

    def feature_record_class(self) -> type[FeatureRecord]:
        """Create a dynamically generated Pydantic model class for this feature.

        The returned class is a subclass of FeatureRecord with fields derived from
        the feature table's columns. Term columns accept vocabulary term names (str),
        asset columns accept file paths (str | Path), and value columns are typed
        according to their database column type (int, float, str).

        Returns:
            A FeatureRecord subclass with validated fields matching this feature's schema.
        """

        def map_type(c: Column) -> UnionType | Type[str] | Type[int] | Type[float]:
            """Maps a Deriva column type to a Python/pydantic type.

            Converts ERMrest column types to appropriate Python types for use in pydantic models.
            Special handling is provided for asset columns which can accept either strings or Path objects.

            Args:
                c: ERMrest column to map to a Python type.

            Returns:
                UnionType | Type[str] | Type[int] | Type[float]: Appropriate Python type for the column:
                    - str | Path for asset columns
                    - str for text columns
                    - int for integer columns
                    - float for floating point columns
                    - str for all other types

            Example:
                >>> col = Column(name="score", type="float4")
                >>> typ = map_type(col)  # Returns float
            """
            if c.name in {c.name for c in self.asset_columns}:
                return str | Path

            match c.type.typename:
                case "text":
                    return str
                case "int2" | "int4" | "int8":
                    return int
                case "float4" | "float8":
                    return float
                case "boolean":
                    return bool
                case _:
                    return str

        featureclass_name = f"{self.target_table.name}Feature{self.feature_name}"

        # Create feature class. To do this, we must determine the python type for each column and also if the
        # column is optional or not based on its nullability.
        feature_columns = {
            c.name: (
                Optional[map_type(c)] if c.nullok else map_type(c),
                c.default or None,
            )
            for c in self.feature_columns
        } | {
            "Feature_Name": (
                str,
                self.feature_name,
            ),  # Set default value for Feature_Name
            self.target_table.name: (str, ...),
        }
        docstring = (
            f"Class to capture fields in a feature {self.feature_name} on table {self.target_table}. "
            "Feature columns include:\n"
        )
        docstring += "\n".join([f"    {c.name}" for c in self.feature_columns])

        model = create_model(
            featureclass_name,
            __base__=FeatureRecord,
            __doc__=docstring,
            **feature_columns,
        )
        model.feature = self  # Set value of class variable within the feature class definition.

        return model

    def __repr__(self) -> str:
        return (
            f"Feature(target_table={self.target_table.name}, feature_name={self.feature_name}, "
            f"feature_table={self.feature_table.name})"
        )

feature_record_class

feature_record_class() -> type[
    FeatureRecord
]

Create a dynamically generated Pydantic model class for this feature.

The returned class is a subclass of FeatureRecord with fields derived from the feature table's columns. Term columns accept vocabulary term names (str), asset columns accept file paths (str | Path), and value columns are typed according to their database column type (int, float, str).

Returns:

Type Description
type[FeatureRecord]

A FeatureRecord subclass with validated fields matching this feature's schema.

Source code in src/deriva_ml/feature.py
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
def feature_record_class(self) -> type[FeatureRecord]:
    """Create a dynamically generated Pydantic model class for this feature.

    The returned class is a subclass of FeatureRecord with fields derived from
    the feature table's columns. Term columns accept vocabulary term names (str),
    asset columns accept file paths (str | Path), and value columns are typed
    according to their database column type (int, float, str).

    Returns:
        A FeatureRecord subclass with validated fields matching this feature's schema.
    """

    def map_type(c: Column) -> UnionType | Type[str] | Type[int] | Type[float]:
        """Maps a Deriva column type to a Python/pydantic type.

        Converts ERMrest column types to appropriate Python types for use in pydantic models.
        Special handling is provided for asset columns which can accept either strings or Path objects.

        Args:
            c: ERMrest column to map to a Python type.

        Returns:
            UnionType | Type[str] | Type[int] | Type[float]: Appropriate Python type for the column:
                - str | Path for asset columns
                - str for text columns
                - int for integer columns
                - float for floating point columns
                - str for all other types

        Example:
            >>> col = Column(name="score", type="float4")
            >>> typ = map_type(col)  # Returns float
        """
        if c.name in {c.name for c in self.asset_columns}:
            return str | Path

        match c.type.typename:
            case "text":
                return str
            case "int2" | "int4" | "int8":
                return int
            case "float4" | "float8":
                return float
            case "boolean":
                return bool
            case _:
                return str

    featureclass_name = f"{self.target_table.name}Feature{self.feature_name}"

    # Create feature class. To do this, we must determine the python type for each column and also if the
    # column is optional or not based on its nullability.
    feature_columns = {
        c.name: (
            Optional[map_type(c)] if c.nullok else map_type(c),
            c.default or None,
        )
        for c in self.feature_columns
    } | {
        "Feature_Name": (
            str,
            self.feature_name,
        ),  # Set default value for Feature_Name
        self.target_table.name: (str, ...),
    }
    docstring = (
        f"Class to capture fields in a feature {self.feature_name} on table {self.target_table}. "
        "Feature columns include:\n"
    )
    docstring += "\n".join([f"    {c.name}" for c in self.feature_columns])

    model = create_model(
        featureclass_name,
        __base__=FeatureRecord,
        __doc__=docstring,
        **feature_columns,
    )
    model.feature = self  # Set value of class variable within the feature class definition.

    return model

FeatureRecord

Bases: BaseModel

Base class for dynamically generated feature record models.

This class serves as the base for pydantic models that represent feature records. Each feature record contains the values and metadata associated with a feature instance. Subclasses are created dynamically by Feature.feature_record_class() with fields corresponding to the feature's vocabulary terms, asset references, and metadata columns.

Feature records are returned by list_feature_values() and fetch_table_features(). They can also be constructed manually and passed to Execution.add_features() to insert new values into the catalog.

Handling multiple values per target:

When the same target object (e.g., an Image) has multiple feature values — for example, labels from different annotators or model runs — use a selector function to choose one. Pass it to fetch_table_features or list_feature_values. A selector receives a list of FeatureRecord instances for the same target and returns the selected one::

# Built-in: pick the most recently created record
features = ml.fetch_table_features(
    "Image", selector=FeatureRecord.select_newest
)

# Custom: pick the record with highest confidence
def select_best(records):
    return max(records, key=lambda r: getattr(r, "Confidence", 0))

features = ml.fetch_table_features("Image", selector=select_best)

Attributes:

Name Type Description
Execution Optional[str]

RID of the execution that created this feature record. Links to the Execution table for provenance tracking — use this to trace which workflow run produced this value.

Feature_Name str

Name of the feature this record belongs to.

RCT Optional[str]

Row Creation Time — an ISO 8601 timestamp string (e.g., "2024-06-15T10:30:00.000000+00:00"). Populated automatically when reading from the catalog or a dataset bag. Used by select_newest to determine recency.

feature ClassVar[Optional[Feature]]

Reference to the Feature definition object. Set automatically by feature_record_class() when the dynamic subclass is created. None on the base class. Provides access to the feature's column metadata, target table, and vocabulary/asset column sets.

Source code in src/deriva_ml/feature.py
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
class FeatureRecord(BaseModel):
    """Base class for dynamically generated feature record models.

    This class serves as the base for pydantic models that represent feature
    records. Each feature record contains the values and metadata associated
    with a feature instance. Subclasses are created dynamically by
    ``Feature.feature_record_class()`` with fields corresponding to the
    feature's vocabulary terms, asset references, and metadata columns.

    Feature records are returned by ``list_feature_values()`` and
    ``fetch_table_features()``. They can also be constructed manually and
    passed to ``Execution.add_features()`` to insert new values into the
    catalog.

    **Handling multiple values per target:**

    When the same target object (e.g., an Image) has multiple feature values
    — for example, labels from different annotators or model runs — use
    a ``selector`` function to choose one. Pass it to ``fetch_table_features``
    or ``list_feature_values``. A selector receives a list of FeatureRecord
    instances for the same target and returns the selected one::

        # Built-in: pick the most recently created record
        features = ml.fetch_table_features(
            "Image", selector=FeatureRecord.select_newest
        )

        # Custom: pick the record with highest confidence
        def select_best(records):
            return max(records, key=lambda r: getattr(r, "Confidence", 0))

        features = ml.fetch_table_features("Image", selector=select_best)

    Attributes:
        Execution (Optional[str]): RID of the execution that created this
            feature record. Links to the ``Execution`` table for provenance
            tracking — use this to trace which workflow run produced this value.
        Feature_Name (str): Name of the feature this record belongs to.
        RCT (Optional[str]): Row Creation Time — an ISO 8601 timestamp string
            (e.g., ``"2024-06-15T10:30:00.000000+00:00"``). Populated
            automatically when reading from the catalog or a dataset bag.
            Used by ``select_newest`` to determine recency.
        feature (ClassVar[Optional[Feature]]): Reference to the Feature
            definition object. Set automatically by ``feature_record_class()``
            when the dynamic subclass is created. ``None`` on the base class.
            Provides access to the feature's column metadata, target table,
            and vocabulary/asset column sets.
    """

    # model_dump of this feature should be compatible with feature table columns.
    Execution: Optional[str] = None
    Feature_Name: str
    RCT: Optional[str] = None
    feature: ClassVar[Optional["Feature"]] = None

    class Config:
        arbitrary_types_allowed = True
        extra = "forbid"

    @staticmethod
    def select_newest(records: list["FeatureRecord"]) -> "FeatureRecord":
        """Select the feature record with the most recent creation time.

        Uses the RCT (Row Creation Time) field to determine recency. RCT is
        an ISO 8601 timestamp string, so lexicographic comparison correctly
        identifies the most recent record. Records with ``None`` RCT are
        treated as older than any timestamped record.

        This method is designed to be passed directly as the ``selector``
        argument to ``fetch_table_features`` or ``list_feature_values``::

            features = ml.fetch_table_features(
                "Image", selector=FeatureRecord.select_newest
            )

        Args:
            records: List of FeatureRecord instances for the same target
                object. Must be non-empty.

        Returns:
            The FeatureRecord with the latest RCT value.
        """
        return max(records, key=lambda r: r.RCT or "")

    @staticmethod
    def select_by_execution(execution_rid: str):
        """Return a selector that picks the newest record from a specific execution.

        Creates a selector function that filters records to those produced by
        the given execution, then returns the newest match by RCT. This is
        useful when multiple executions have produced values for the same
        feature and you want results from a specific run.

        Unlike ``select_by_workflow`` (which requires catalog access and lives
        on the DerivaML class), this selector works purely on the
        ``Execution`` field of each record and can be passed directly as the
        ``selector`` argument to ``fetch_table_features`` or
        ``list_feature_values``::

            features = ml.fetch_table_features(
                "Image",
                feature_name="FooBar",
                selector=FeatureRecord.select_by_execution("3WY2"),
            )

        Args:
            execution_rid: RID of the execution to filter by.

        Returns:
            A selector function ``(list[FeatureRecord]) -> FeatureRecord``
            suitable for use with ``fetch_table_features`` or
            ``list_feature_values``.

        Raises:
            DerivaMLException: If no records in the group match the
                given execution RID.

        Examples:
            Select values from a specific execution::

                >>> features = ml.fetch_table_features(
                ...     "Image",
                ...     feature_name="Classification",
                ...     selector=FeatureRecord.select_by_execution("3WY2"),
                ... )

            Use with list_feature_values::

                >>> values = ml.list_feature_values(
                ...     "Image", "Classification",
                ...     selector=FeatureRecord.select_by_execution("3WY2"),
                ... )
        """

        def _selector(records: list["FeatureRecord"]) -> "FeatureRecord":
            filtered = [r for r in records if r.Execution == execution_rid]
            if not filtered:
                from deriva_ml.core.exceptions import DerivaMLException

                raise DerivaMLException(
                    f"No feature records match execution '{execution_rid}'."
                )
            return FeatureRecord.select_newest(filtered)

        return _selector

    @staticmethod
    def select_first(records: list["FeatureRecord"]) -> "FeatureRecord":
        """Select the feature record with the earliest creation time.

        Uses the RCT (Row Creation Time) field. Records with ``None`` RCT
        are treated as older than any timestamped record (since empty string
        sorts before any ISO 8601 timestamp).

        Useful when you want to preserve the original annotation and ignore
        later revisions.

        This method is designed to be passed directly as the ``selector``
        argument to ``fetch_table_features`` or ``list_feature_values``::

            features = ml.fetch_table_features(
                "Image", selector=FeatureRecord.select_first
            )

        Args:
            records: List of FeatureRecord instances for the same target
                object. Must be non-empty.

        Returns:
            The FeatureRecord with the earliest RCT value.
        """
        return min(records, key=lambda r: r.RCT or "")

    @staticmethod
    def select_latest(records: list["FeatureRecord"]) -> "FeatureRecord":
        """Select the most recently created feature record.

        Alias for ``select_newest``. Included for API symmetry with
        ``select_first``.

        Args:
            records: List of FeatureRecord instances for the same target
                object. Must be non-empty.

        Returns:
            The FeatureRecord with the latest RCT value.
        """
        return FeatureRecord.select_newest(records)

    @classmethod
    def select_majority_vote(cls, column: str | None = None):
        """Return a selector that picks the most common value for a column.

        Creates a selector function that counts the values of the specified
        column across all records, picks the most frequent one, and breaks
        ties by most recent RCT.

        For single-term features, the column can be auto-detected from the
        feature's metadata. For multi-term features, the column must be
        specified explicitly.

        This is useful for consensus labeling, where multiple annotators
        have labeled the same record and you want the majority opinion::

            selector = RecordClass.select_majority_vote()
            features = ml.fetch_table_features(
                "Image",
                feature_name="Diagnosis",
                selector=selector,
            )

        Args:
            column: Name of the column to count values for. If None,
                auto-detects the first term column from feature metadata.

        Returns:
            A selector function ``(list[FeatureRecord]) -> FeatureRecord``.

        Raises:
            DerivaMLException: If column is None and the feature has no
                term columns or multiple term columns.
        """

        def _selector(records: list["FeatureRecord"]) -> "FeatureRecord":
            col = column
            if col is None:
                # Auto-detect from feature metadata on the record class
                record_cls = type(records[0])
                if (
                    hasattr(record_cls, "feature")
                    and record_cls.feature
                    and record_cls.feature.term_columns
                ):
                    if len(record_cls.feature.term_columns) == 1:
                        col = record_cls.feature.term_columns[0].name
                    else:
                        from deriva_ml.core.exceptions import (
                            DerivaMLException,
                        )

                        raise DerivaMLException(
                            "select_majority_vote requires a column name for "
                            "features with multiple term columns. "
                            f"Available: {[c.name for c in record_cls.feature.term_columns]}"
                        )
                else:
                    from deriva_ml.core.exceptions import (
                        DerivaMLException,
                    )

                    raise DerivaMLException(
                        "select_majority_vote requires a column name — "
                        "could not auto-detect from feature metadata."
                    )

            from collections import Counter

            counts = Counter(getattr(r, col, None) for r in records)
            max_count = max(counts.values())
            majority_values = {v for v, c in counts.items() if c == max_count}
            candidates = [
                r for r in records if getattr(r, col, None) in majority_values
            ]
            return max(candidates, key=lambda r: r.RCT or "")

        return _selector

    @classmethod
    def feature_columns(cls) -> set[Column]:
        """Returns all columns specific to this feature.

        Returns:
            set[Column]: Set of feature-specific columns, excluding system and relationship columns.
        """
        return cls.feature.feature_columns

    @classmethod
    def asset_columns(cls) -> set[Column]:
        """Returns columns that reference asset tables.

        Returns:
            set[Column]: Set of columns that contain references to asset tables.
        """
        return cls.feature.asset_columns

    @classmethod
    def term_columns(cls) -> set[Column]:
        """Returns columns that reference vocabulary terms.

        Returns:
            set[Column]: Set of columns that contain references to controlled vocabulary terms.
        """
        return cls.feature.term_columns

    @classmethod
    def value_columns(cls) -> set[Column]:
        """Returns columns that contain direct values.

        Returns:
            set[Column]: Set of columns containing direct values (not references to assets or terms).
        """
        return cls.feature.value_columns

asset_columns classmethod

asset_columns() -> set[Column]

Returns columns that reference asset tables.

Returns:

Type Description
set[Column]

set[Column]: Set of columns that contain references to asset tables.

Source code in src/deriva_ml/feature.py
307
308
309
310
311
312
313
314
@classmethod
def asset_columns(cls) -> set[Column]:
    """Returns columns that reference asset tables.

    Returns:
        set[Column]: Set of columns that contain references to asset tables.
    """
    return cls.feature.asset_columns

feature_columns classmethod

feature_columns() -> set[Column]

Returns all columns specific to this feature.

Returns:

Type Description
set[Column]

set[Column]: Set of feature-specific columns, excluding system and relationship columns.

Source code in src/deriva_ml/feature.py
298
299
300
301
302
303
304
305
@classmethod
def feature_columns(cls) -> set[Column]:
    """Returns all columns specific to this feature.

    Returns:
        set[Column]: Set of feature-specific columns, excluding system and relationship columns.
    """
    return cls.feature.feature_columns

select_by_execution staticmethod

select_by_execution(execution_rid: str)

Return a selector that picks the newest record from a specific execution.

Creates a selector function that filters records to those produced by the given execution, then returns the newest match by RCT. This is useful when multiple executions have produced values for the same feature and you want results from a specific run.

Unlike select_by_workflow (which requires catalog access and lives on the DerivaML class), this selector works purely on the Execution field of each record and can be passed directly as the selector argument to fetch_table_features or list_feature_values::

features = ml.fetch_table_features(
    "Image",
    feature_name="FooBar",
    selector=FeatureRecord.select_by_execution("3WY2"),
)

Parameters:

Name Type Description Default
execution_rid str

RID of the execution to filter by.

required

Returns:

Type Description

A selector function (list[FeatureRecord]) -> FeatureRecord

suitable for use with fetch_table_features or

list_feature_values.

Raises:

Type Description
DerivaMLException

If no records in the group match the given execution RID.

Examples:

Select values from a specific execution::

>>> features = ml.fetch_table_features(
...     "Image",
...     feature_name="Classification",
...     selector=FeatureRecord.select_by_execution("3WY2"),
... )

Use with list_feature_values::

>>> values = ml.list_feature_values(
...     "Image", "Classification",
...     selector=FeatureRecord.select_by_execution("3WY2"),
... )
Source code in src/deriva_ml/feature.py
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
@staticmethod
def select_by_execution(execution_rid: str):
    """Return a selector that picks the newest record from a specific execution.

    Creates a selector function that filters records to those produced by
    the given execution, then returns the newest match by RCT. This is
    useful when multiple executions have produced values for the same
    feature and you want results from a specific run.

    Unlike ``select_by_workflow`` (which requires catalog access and lives
    on the DerivaML class), this selector works purely on the
    ``Execution`` field of each record and can be passed directly as the
    ``selector`` argument to ``fetch_table_features`` or
    ``list_feature_values``::

        features = ml.fetch_table_features(
            "Image",
            feature_name="FooBar",
            selector=FeatureRecord.select_by_execution("3WY2"),
        )

    Args:
        execution_rid: RID of the execution to filter by.

    Returns:
        A selector function ``(list[FeatureRecord]) -> FeatureRecord``
        suitable for use with ``fetch_table_features`` or
        ``list_feature_values``.

    Raises:
        DerivaMLException: If no records in the group match the
            given execution RID.

    Examples:
        Select values from a specific execution::

            >>> features = ml.fetch_table_features(
            ...     "Image",
            ...     feature_name="Classification",
            ...     selector=FeatureRecord.select_by_execution("3WY2"),
            ... )

        Use with list_feature_values::

            >>> values = ml.list_feature_values(
            ...     "Image", "Classification",
            ...     selector=FeatureRecord.select_by_execution("3WY2"),
            ... )
    """

    def _selector(records: list["FeatureRecord"]) -> "FeatureRecord":
        filtered = [r for r in records if r.Execution == execution_rid]
        if not filtered:
            from deriva_ml.core.exceptions import DerivaMLException

            raise DerivaMLException(
                f"No feature records match execution '{execution_rid}'."
            )
        return FeatureRecord.select_newest(filtered)

    return _selector

select_first staticmethod

select_first(
    records: list[FeatureRecord],
) -> FeatureRecord

Select the feature record with the earliest creation time.

Uses the RCT (Row Creation Time) field. Records with None RCT are treated as older than any timestamped record (since empty string sorts before any ISO 8601 timestamp).

Useful when you want to preserve the original annotation and ignore later revisions.

This method is designed to be passed directly as the selector argument to fetch_table_features or list_feature_values::

features = ml.fetch_table_features(
    "Image", selector=FeatureRecord.select_first
)

Parameters:

Name Type Description Default
records list[FeatureRecord]

List of FeatureRecord instances for the same target object. Must be non-empty.

required

Returns:

Type Description
FeatureRecord

The FeatureRecord with the earliest RCT value.

Source code in src/deriva_ml/feature.py
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
@staticmethod
def select_first(records: list["FeatureRecord"]) -> "FeatureRecord":
    """Select the feature record with the earliest creation time.

    Uses the RCT (Row Creation Time) field. Records with ``None`` RCT
    are treated as older than any timestamped record (since empty string
    sorts before any ISO 8601 timestamp).

    Useful when you want to preserve the original annotation and ignore
    later revisions.

    This method is designed to be passed directly as the ``selector``
    argument to ``fetch_table_features`` or ``list_feature_values``::

        features = ml.fetch_table_features(
            "Image", selector=FeatureRecord.select_first
        )

    Args:
        records: List of FeatureRecord instances for the same target
            object. Must be non-empty.

    Returns:
        The FeatureRecord with the earliest RCT value.
    """
    return min(records, key=lambda r: r.RCT or "")

select_latest staticmethod

select_latest(
    records: list[FeatureRecord],
) -> FeatureRecord

Select the most recently created feature record.

Alias for select_newest. Included for API symmetry with select_first.

Parameters:

Name Type Description Default
records list[FeatureRecord]

List of FeatureRecord instances for the same target object. Must be non-empty.

required

Returns:

Type Description
FeatureRecord

The FeatureRecord with the latest RCT value.

Source code in src/deriva_ml/feature.py
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
@staticmethod
def select_latest(records: list["FeatureRecord"]) -> "FeatureRecord":
    """Select the most recently created feature record.

    Alias for ``select_newest``. Included for API symmetry with
    ``select_first``.

    Args:
        records: List of FeatureRecord instances for the same target
            object. Must be non-empty.

    Returns:
        The FeatureRecord with the latest RCT value.
    """
    return FeatureRecord.select_newest(records)

select_majority_vote classmethod

select_majority_vote(
    column: str | None = None,
)

Return a selector that picks the most common value for a column.

Creates a selector function that counts the values of the specified column across all records, picks the most frequent one, and breaks ties by most recent RCT.

For single-term features, the column can be auto-detected from the feature's metadata. For multi-term features, the column must be specified explicitly.

This is useful for consensus labeling, where multiple annotators have labeled the same record and you want the majority opinion::

selector = RecordClass.select_majority_vote()
features = ml.fetch_table_features(
    "Image",
    feature_name="Diagnosis",
    selector=selector,
)

Parameters:

Name Type Description Default
column str | None

Name of the column to count values for. If None, auto-detects the first term column from feature metadata.

None

Returns:

Type Description

A selector function (list[FeatureRecord]) -> FeatureRecord.

Raises:

Type Description
DerivaMLException

If column is None and the feature has no term columns or multiple term columns.

Source code in src/deriva_ml/feature.py
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
@classmethod
def select_majority_vote(cls, column: str | None = None):
    """Return a selector that picks the most common value for a column.

    Creates a selector function that counts the values of the specified
    column across all records, picks the most frequent one, and breaks
    ties by most recent RCT.

    For single-term features, the column can be auto-detected from the
    feature's metadata. For multi-term features, the column must be
    specified explicitly.

    This is useful for consensus labeling, where multiple annotators
    have labeled the same record and you want the majority opinion::

        selector = RecordClass.select_majority_vote()
        features = ml.fetch_table_features(
            "Image",
            feature_name="Diagnosis",
            selector=selector,
        )

    Args:
        column: Name of the column to count values for. If None,
            auto-detects the first term column from feature metadata.

    Returns:
        A selector function ``(list[FeatureRecord]) -> FeatureRecord``.

    Raises:
        DerivaMLException: If column is None and the feature has no
            term columns or multiple term columns.
    """

    def _selector(records: list["FeatureRecord"]) -> "FeatureRecord":
        col = column
        if col is None:
            # Auto-detect from feature metadata on the record class
            record_cls = type(records[0])
            if (
                hasattr(record_cls, "feature")
                and record_cls.feature
                and record_cls.feature.term_columns
            ):
                if len(record_cls.feature.term_columns) == 1:
                    col = record_cls.feature.term_columns[0].name
                else:
                    from deriva_ml.core.exceptions import (
                        DerivaMLException,
                    )

                    raise DerivaMLException(
                        "select_majority_vote requires a column name for "
                        "features with multiple term columns. "
                        f"Available: {[c.name for c in record_cls.feature.term_columns]}"
                    )
            else:
                from deriva_ml.core.exceptions import (
                    DerivaMLException,
                )

                raise DerivaMLException(
                    "select_majority_vote requires a column name — "
                    "could not auto-detect from feature metadata."
                )

        from collections import Counter

        counts = Counter(getattr(r, col, None) for r in records)
        max_count = max(counts.values())
        majority_values = {v for v, c in counts.items() if c == max_count}
        candidates = [
            r for r in records if getattr(r, col, None) in majority_values
        ]
        return max(candidates, key=lambda r: r.RCT or "")

    return _selector

select_newest staticmethod

select_newest(
    records: list[FeatureRecord],
) -> FeatureRecord

Select the feature record with the most recent creation time.

Uses the RCT (Row Creation Time) field to determine recency. RCT is an ISO 8601 timestamp string, so lexicographic comparison correctly identifies the most recent record. Records with None RCT are treated as older than any timestamped record.

This method is designed to be passed directly as the selector argument to fetch_table_features or list_feature_values::

features = ml.fetch_table_features(
    "Image", selector=FeatureRecord.select_newest
)

Parameters:

Name Type Description Default
records list[FeatureRecord]

List of FeatureRecord instances for the same target object. Must be non-empty.

required

Returns:

Type Description
FeatureRecord

The FeatureRecord with the latest RCT value.

Source code in src/deriva_ml/feature.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
@staticmethod
def select_newest(records: list["FeatureRecord"]) -> "FeatureRecord":
    """Select the feature record with the most recent creation time.

    Uses the RCT (Row Creation Time) field to determine recency. RCT is
    an ISO 8601 timestamp string, so lexicographic comparison correctly
    identifies the most recent record. Records with ``None`` RCT are
    treated as older than any timestamped record.

    This method is designed to be passed directly as the ``selector``
    argument to ``fetch_table_features`` or ``list_feature_values``::

        features = ml.fetch_table_features(
            "Image", selector=FeatureRecord.select_newest
        )

    Args:
        records: List of FeatureRecord instances for the same target
            object. Must be non-empty.

    Returns:
        The FeatureRecord with the latest RCT value.
    """
    return max(records, key=lambda r: r.RCT or "")

term_columns classmethod

term_columns() -> set[Column]

Returns columns that reference vocabulary terms.

Returns:

Type Description
set[Column]

set[Column]: Set of columns that contain references to controlled vocabulary terms.

Source code in src/deriva_ml/feature.py
316
317
318
319
320
321
322
323
@classmethod
def term_columns(cls) -> set[Column]:
    """Returns columns that reference vocabulary terms.

    Returns:
        set[Column]: Set of columns that contain references to controlled vocabulary terms.
    """
    return cls.feature.term_columns

value_columns classmethod

value_columns() -> set[Column]

Returns columns that contain direct values.

Returns:

Type Description
set[Column]

set[Column]: Set of columns containing direct values (not references to assets or terms).

Source code in src/deriva_ml/feature.py
325
326
327
328
329
330
331
332
@classmethod
def value_columns(cls) -> set[Column]:
    """Returns columns that contain direct values.

    Returns:
        set[Column]: Set of columns containing direct values (not references to assets or terms).
    """
    return cls.feature.value_columns