Feature Classes
Feature management for ML experiments. Features represent measurable properties or characteristics that can be attached to domain entities and tracked across executions.
Feature implementation for deriva-ml.
This module provides classes for defining and managing features in deriva-ml. Features represent measurable properties or characteristics that can be associated with records in a table. The module includes:
- Feature: Main class for defining and managing features
- FeatureRecord: Base class for feature records using pydantic models
Typical usage example
feature = Feature(association_result, model) FeatureClass = feature.feature_record_class() record = FeatureClass(value="high", confidence=0.95)
Feature
Manages feature definitions and their relationships in the catalog.
A Feature represents a measurable property or characteristic that can be associated with records in a table. Features can include asset references, controlled vocabulary terms, and custom metadata fields.
Attributes:
| Name | Type | Description |
|---|---|---|
feature_table |
Table
|
Table containing the feature implementation. |
target_table |
Table
|
Table that the feature is associated with. |
feature_name |
str
|
Name of the feature (from Feature_Name column default). |
feature_columns |
set[Column]
|
All columns specific to this feature. |
asset_columns |
set[Column]
|
Columns referencing asset tables. |
term_columns |
set[Column]
|
Columns referencing vocabulary tables. |
value_columns |
set[Column]
|
Columns containing direct values (not FK references). |
Example
feature = Feature(association_result, model) print(f"Feature {feature.feature_name} on {feature.target_table.name}") print("Asset columns:", [c.name for c in feature.asset_columns])
Source code in src/deriva_ml/feature.py
335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 | |
feature_record_class
feature_record_class() -> type[
FeatureRecord
]
Create a dynamically generated Pydantic model class for this feature.
The returned class is a subclass of FeatureRecord with fields derived from the feature table's columns. Term columns accept vocabulary term names (str), asset columns accept file paths (str | Path), and value columns are typed according to their database column type (int, float, str).
Returns:
| Type | Description |
|---|---|
type[FeatureRecord]
|
A FeatureRecord subclass with validated fields matching this feature's schema. |
Source code in src/deriva_ml/feature.py
391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 | |
FeatureRecord
Bases: BaseModel
Base class for dynamically generated feature record models.
This class serves as the base for pydantic models that represent feature
records. Each feature record contains the values and metadata associated
with a feature instance. Subclasses are created dynamically by
Feature.feature_record_class() with fields corresponding to the
feature's vocabulary terms, asset references, and metadata columns.
Feature records are returned by list_feature_values() and
fetch_table_features(). They can also be constructed manually and
passed to Execution.add_features() to insert new values into the
catalog.
Handling multiple values per target:
When the same target object (e.g., an Image) has multiple feature values
— for example, labels from different annotators or model runs — use
a selector function to choose one. Pass it to fetch_table_features
or list_feature_values. A selector receives a list of FeatureRecord
instances for the same target and returns the selected one::
# Built-in: pick the most recently created record
features = ml.fetch_table_features(
"Image", selector=FeatureRecord.select_newest
)
# Custom: pick the record with highest confidence
def select_best(records):
return max(records, key=lambda r: getattr(r, "Confidence", 0))
features = ml.fetch_table_features("Image", selector=select_best)
Attributes:
| Name | Type | Description |
|---|---|---|
Execution |
Optional[str]
|
RID of the execution that created this
feature record. Links to the |
Feature_Name |
str
|
Name of the feature this record belongs to. |
RCT |
Optional[str]
|
Row Creation Time — an ISO 8601 timestamp string
(e.g., |
feature |
ClassVar[Optional[Feature]]
|
Reference to the Feature
definition object. Set automatically by |
Source code in src/deriva_ml/feature.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 | |
asset_columns
classmethod
asset_columns() -> set[Column]
Returns columns that reference asset tables.
Returns:
| Type | Description |
|---|---|
set[Column]
|
set[Column]: Set of columns that contain references to asset tables. |
Source code in src/deriva_ml/feature.py
307 308 309 310 311 312 313 314 | |
feature_columns
classmethod
feature_columns() -> set[Column]
Returns all columns specific to this feature.
Returns:
| Type | Description |
|---|---|
set[Column]
|
set[Column]: Set of feature-specific columns, excluding system and relationship columns. |
Source code in src/deriva_ml/feature.py
298 299 300 301 302 303 304 305 | |
select_by_execution
staticmethod
select_by_execution(execution_rid: str)
Return a selector that picks the newest record from a specific execution.
Creates a selector function that filters records to those produced by the given execution, then returns the newest match by RCT. This is useful when multiple executions have produced values for the same feature and you want results from a specific run.
Unlike select_by_workflow (which requires catalog access and lives
on the DerivaML class), this selector works purely on the
Execution field of each record and can be passed directly as the
selector argument to fetch_table_features or
list_feature_values::
features = ml.fetch_table_features(
"Image",
feature_name="FooBar",
selector=FeatureRecord.select_by_execution("3WY2"),
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
execution_rid
|
str
|
RID of the execution to filter by. |
required |
Returns:
| Type | Description |
|---|---|
|
A selector function |
|
|
suitable for use with |
|
|
|
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If no records in the group match the given execution RID. |
Examples:
Select values from a specific execution::
>>> features = ml.fetch_table_features(
... "Image",
... feature_name="Classification",
... selector=FeatureRecord.select_by_execution("3WY2"),
... )
Use with list_feature_values::
>>> values = ml.list_feature_values(
... "Image", "Classification",
... selector=FeatureRecord.select_by_execution("3WY2"),
... )
Source code in src/deriva_ml/feature.py
115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 | |
select_first
staticmethod
select_first(
records: list[FeatureRecord],
) -> FeatureRecord
Select the feature record with the earliest creation time.
Uses the RCT (Row Creation Time) field. Records with None RCT
are treated as older than any timestamped record (since empty string
sorts before any ISO 8601 timestamp).
Useful when you want to preserve the original annotation and ignore later revisions.
This method is designed to be passed directly as the selector
argument to fetch_table_features or list_feature_values::
features = ml.fetch_table_features(
"Image", selector=FeatureRecord.select_first
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
records
|
list[FeatureRecord]
|
List of FeatureRecord instances for the same target object. Must be non-empty. |
required |
Returns:
| Type | Description |
|---|---|
FeatureRecord
|
The FeatureRecord with the earliest RCT value. |
Source code in src/deriva_ml/feature.py
177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 | |
select_latest
staticmethod
select_latest(
records: list[FeatureRecord],
) -> FeatureRecord
Select the most recently created feature record.
Alias for select_newest. Included for API symmetry with
select_first.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
records
|
list[FeatureRecord]
|
List of FeatureRecord instances for the same target object. Must be non-empty. |
required |
Returns:
| Type | Description |
|---|---|
FeatureRecord
|
The FeatureRecord with the latest RCT value. |
Source code in src/deriva_ml/feature.py
204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 | |
select_majority_vote
classmethod
select_majority_vote(
column: str | None = None,
)
Return a selector that picks the most common value for a column.
Creates a selector function that counts the values of the specified column across all records, picks the most frequent one, and breaks ties by most recent RCT.
For single-term features, the column can be auto-detected from the feature's metadata. For multi-term features, the column must be specified explicitly.
This is useful for consensus labeling, where multiple annotators have labeled the same record and you want the majority opinion::
selector = RecordClass.select_majority_vote()
features = ml.fetch_table_features(
"Image",
feature_name="Diagnosis",
selector=selector,
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str | None
|
Name of the column to count values for. If None, auto-detects the first term column from feature metadata. |
None
|
Returns:
| Type | Description |
|---|---|
|
A selector function |
Raises:
| Type | Description |
|---|---|
DerivaMLException
|
If column is None and the feature has no term columns or multiple term columns. |
Source code in src/deriva_ml/feature.py
220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 | |
select_newest
staticmethod
select_newest(
records: list[FeatureRecord],
) -> FeatureRecord
Select the feature record with the most recent creation time.
Uses the RCT (Row Creation Time) field to determine recency. RCT is
an ISO 8601 timestamp string, so lexicographic comparison correctly
identifies the most recent record. Records with None RCT are
treated as older than any timestamped record.
This method is designed to be passed directly as the selector
argument to fetch_table_features or list_feature_values::
features = ml.fetch_table_features(
"Image", selector=FeatureRecord.select_newest
)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
records
|
list[FeatureRecord]
|
List of FeatureRecord instances for the same target object. Must be non-empty. |
required |
Returns:
| Type | Description |
|---|---|
FeatureRecord
|
The FeatureRecord with the latest RCT value. |
Source code in src/deriva_ml/feature.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | |
term_columns
classmethod
term_columns() -> set[Column]
Returns columns that reference vocabulary terms.
Returns:
| Type | Description |
|---|---|
set[Column]
|
set[Column]: Set of columns that contain references to controlled vocabulary terms. |
Source code in src/deriva_ml/feature.py
316 317 318 319 320 321 322 323 | |
value_columns
classmethod
value_columns() -> set[Column]
Returns columns that contain direct values.
Returns:
| Type | Description |
|---|---|
set[Column]
|
set[Column]: Set of columns containing direct values (not references to assets or terms). |
Source code in src/deriva_ml/feature.py
325 326 327 328 329 330 331 332 | |