Chapter 3: Defining and using features
Features are structured, provenance-linked annotations on existing catalog records. By the end of this chapter you will know how to define features and vocabularies, record values inside an execution, read them back using the three-method S2 surface, reduce multi-annotator data with selectors, and work with asset-based features — both online and from a downloaded bag. You will also understand when to choose each reading method and how selectors differ in their error-handling contracts.
What is a feature?
A feature associates metadata values with records in your domain tables. The key distinction from a plain table column is provenance: every feature value records the execution that produced it, which workflow ran that execution, and when.
Four properties distinguish features from columns:
- Provenance. Every value is linked to an execution, so you always know which model or annotator produced it.
- Controlled vocabularies. Categorical values come from a vocabulary table rather than free text, enforcing consistency across runs.
- Multiple values. The same record can carry several values for the same feature — one per annotator, one per model version, or one per time window.
- Versioned. Feature values travel with the dataset version they were captured under, so an older dataset snapshot returns the labels that existed then.
A feature is backed by an association table created automatically by create_feature. The table has columns for the target record's RID, the feature name, the execution RID, and one column per vocabulary term or asset reference you declared.
Common use cases
| Use case | Example | Type |
|---|---|---|
| Ground truth labels | Normal / Abnormal classification | Term-based |
| Model predictions | Classifier inference results | Term-based |
| Quality scores | Image quality rating 1–5 | Term-based |
| Derived measurements | Computed signal-to-noise ratio | Value-based (metadata column) |
| Related assets | Segmentation masks, embeddings | Asset-based |
How to decide between a feature and a column
Add a column to your domain table when the value is a single, permanent, schema-level property that does not change across runs and does not need provenance (for example, a patient's date of birth, or the pixel dimensions of an image).
Use a feature when any of these is true:
- The value is produced by a workflow (model, annotator, analysis script).
- Multiple sources or versions may produce different values for the same record.
- You want to query "which execution produced this value?" or compare values across runs.
- The value depends on a controlled vocabulary.
- The value is a derived asset (mask, embedding) that needs to be tracked as a file in the object store.
A good rule of thumb: if you will ever write SELECT * FROM ... WHERE execution = ?, use a feature.
How to create a vocabulary
Term-based features require a vocabulary table — a catalog-managed list of terms. Create the vocabulary before the feature that uses it.
ml.create_vocabulary("Diagnosis_Type", "Clinical diagnosis categories")
ml.add_term("Diagnosis_Type", "Normal", "No pathology detected")
ml.add_term("Diagnosis_Type", "Mild", "Mild pathology")
ml.add_term("Diagnosis_Type", "Severe", "Severe pathology")
create_vocabulary creates the table; add_term adds rows to it. Both operations are idempotent by default (exists_ok=True on add_term), so re-running setup code is safe.
Notes
add_termsignature:add_term(table, term_name, description, synonyms=None, exists_ok=True).create_vocabularysignature:create_vocabulary(vocab_name, comment="", schema=None, update_navbar=True).- The same vocabulary can be used by multiple features and across target tables.
- Feature names are themselves vocabulary terms (in the
Feature_Nametable).create_featureadds the name automatically.
How to create a feature
create_feature returns the FeatureRecord subclass you will use to construct values. Calling it is a schema-modifying operation — it creates the backing association table — so it belongs in setup code, not inside the execution loop.
Term-based feature
DiagnosisRecord = ml.create_feature(
target_table="Image",
feature_name="Diagnosis",
terms=["Diagnosis_Type"],
comment="Clinical diagnosis label for this image",
)
Feature with metadata columns
from deriva_ml import ColumnDefinition, BuiltinTypes
DiagnosisRecord = ml.create_feature(
target_table="Image",
feature_name="Diagnosis",
terms=["Diagnosis_Type"],
metadata=[ColumnDefinition(name="Confidence", type=BuiltinTypes.float4)],
optional=["Confidence"],
comment="Diagnosis with optional confidence score",
)
Asset-based feature
ml.create_asset("Segmentation_Mask", comment="Binary segmentation mask images")
SegmentationRecord = ml.create_feature(
target_table="Image",
feature_name="Segmentation",
assets=["Segmentation_Mask"],
comment="Segmentation mask for this image",
)
The returned SegmentationRecord class accepts either a pathlib.Path (file to upload) or a catalog RID for the asset column. During upload the library replaces file paths with the RIDs of the uploaded assets automatically.
Mixed feature
Pass both terms and assets to create a feature whose values include a vocabulary term and an associated asset in a single record.
Notes
create_featuresignature:create_feature(target_table, feature_name, terms=None, assets=None, metadata=None, optional=None, comment="", update_navbar=True).- All columns in
metadataare required by default. Add their names tooptionalto allowNone. - When creating many features in a batch, pass
update_navbar=Falseto each call and runml.apply_catalog_annotations()once at the end. - You can retrieve the record class later without re-creating the feature:
ml.feature_record_class("Image", "Diagnosis").
How to record feature values
Feature values must be created inside an execution context. This is a hard requirement — there is no bypass. The execution provides the provenance link that makes feature values meaningful.
The write path is: get the record class → create instances → call exe.add_features(records) inside the with block.
from deriva_ml.execution import ExecutionConfiguration
from deriva_ml.dataset import DatasetSpec
config = ExecutionConfiguration(
workflow=ml.create_workflow("Labeling", "Annotation"),
datasets=[DatasetSpec(rid=dataset_rid)],
description="Human annotation pass 1",
)
# Retrieve the record class (or use the return value from create_feature)
DiagnosisRecord = ml.feature_record_class("Image", "Diagnosis")
with ml.create_execution(config) as exe:
bag = exe.download_dataset_bag(DatasetSpec(rid=dataset_rid))
records = []
for image in bag.list_dataset_members()["Image"]:
records.append(DiagnosisRecord(
Image=image["RID"], # target record RID
Diagnosis_Type="Normal", # vocabulary term name
))
exe.add_features(records) # staged, not yet in the catalog
exe.upload_execution_outputs() # flush staged features to the catalog
exe.add_features stages records to a local SQLite table with status Pending. They are not visible in the catalog until upload_execution_outputs() runs. The flush happens after asset upload, so feature rows that reference assets are guaranteed to find those asset rows already present.
Asset-based feature values
For asset features, provide a pathlib.Path where you would normally provide a term name. The upload step resolves the path to a catalog RID.
SegmentationRecord = ml.feature_record_class("Image", "Segmentation")
with ml.create_execution(config) as exe:
bag = exe.download_dataset_bag(DatasetSpec(rid=dataset_rid))
records = []
for image in bag.list_dataset_members()["Image"]:
mask_path = exe.asset_file_path(
"Segmentation_Mask",
f"mask_{image['RID']}.png",
)
generate_mask(image, output_path=mask_path) # your model writes here
records.append(SegmentationRecord(
Image=image["RID"],
Segmentation_Mask=mask_path, # Path object, replaced with RID on upload
))
exe.add_features(records)
exe.upload_execution_outputs()
Notes
exe.add_features(features)signature: takeslist[FeatureRecord], returns the count staged.- All records in a single
add_featurescall must share one feature definition. Pass records for different features in separate calls. Executionis auto-filled if unset. You do not need to set it on each record.- Staged records survive a crash: a resumed execution re-flushes
Pendingrows. See Chapter 4 onresume_execution.
How to read feature values
The S2 surface provides three methods, consistent across DerivaML, Dataset, and DatasetBag.
| Method | Returns | Purpose |
|---|---|---|
find_features(table) |
list[Feature] |
Discover what features exist on a table |
feature_values(table, feature_name, selector=...) |
Iterable[FeatureRecord] |
Read typed values, optionally selector-reduced |
lookup_feature(table, feature_name) |
Feature |
Inspect a feature's structure, get its record class |
find_features — discovery
# Features defined on a specific table
for f in ml.find_features("Image"):
print(f.feature_name)
# All features in the catalog
for f in ml.find_features():
print(f"{f.target_table.name}.{f.feature_name}")
feature_values — read typed records
from deriva_ml.feature import FeatureRecord
# All values, unfiltered (multiple records per image are possible)
for rec in ml.feature_values("Image", "Diagnosis"):
print(f"Image {rec.Image}: {rec.Diagnosis_Type} (execution {rec.Execution})")
Each record is a typed Pydantic model. Attributes match the feature's columns: the target RID (rec.Image), term columns (rec.Diagnosis_Type), asset columns, any metadata columns, plus provenance (rec.Execution, rec.RCT).
To convert to a pandas DataFrame:
import pandas as pd
df = pd.DataFrame(r.model_dump() for r in ml.feature_values("Image", "Diagnosis"))
!!! note
feature_values fetches all rows for the feature from the catalog before yielding the first record. It is iterator-shaped for composability, not for streaming very large tables. For tables with millions of rows, apply a selector to reduce the output before collecting.
lookup_feature — structure inspection
feature = ml.lookup_feature("Image", "Diagnosis")
print(f"Term columns: {[c.name for c in feature.term_columns]}")
print(f"Asset columns: {[c.name for c in feature.asset_columns]}")
print(f"Value columns: {[c.name for c in feature.value_columns]}")
# Get the record class without calling create_feature
DiagnosisRecord = feature.feature_record_class()
Notes
find_features(table=None)returns alist[Feature]. Omittableto discover features across all domain tables.feature_valuesreturns an iterator but materializes all rows in memory before yielding the first record — it is not a true streaming cursor. Apply aselectorto reduce memory usage when the feature table is large.lookup_featuredoes not fetch values; it returns the schema-levelFeatureobject and itsfeature_record_class(). Use it when you need column metadata or want a typed record class without runningcreate_featureagain.- All three methods work identically on
DerivaML(live catalog),Dataset(catalog-pinned to a snapshot), andDatasetBag(offline local bag). The only difference is scope:DatasetandDatasetBagare limited to data included in that version's bag.
How to use selectors to reduce multi-annotator data
When the same record has multiple feature values — from different annotators, model runs, or time windows — a selector collapses the group to a single value per target RID.
A selector is any callable: (list[FeatureRecord]) -> FeatureRecord | None. Pass it as selector= to feature_values. A selector that returns None causes that target RID to be omitted from the output.
select_newest — most recent by creation time
The most common selector. Picks the record with the latest RCT (Row Creation Time).
newest = list(ml.feature_values(
"Image", "Diagnosis",
selector=FeatureRecord.select_newest,
))
select_first — earliest creation time
Preserves the original annotation and ignores later revisions.
originals = list(ml.feature_values(
"Image", "Diagnosis",
selector=FeatureRecord.select_first,
))
(select_latest has the same behavior as select_newest.)
select_by_execution — specific execution
Filters to records produced by one known execution RID. Raises if no record in a group matches.
from_run = list(ml.feature_values(
"Image", "Diagnosis",
selector=FeatureRecord.select_by_execution("3-WY2A"),
))
!!! note
select_by_execution raises DerivaMLException when no record in a group matches the given execution RID; select_by_workflow returns None and silently omits that target RID from the iterator. Use select_by_workflow when "no match" is a valid state (some targets may not have been processed by that workflow); use select_by_execution when a missing match is a bug.
select_by_workflow — workflow-scoped factory
FeatureRecord.select_by_workflow is a classmethod factory — it takes the workflow name and a container, resolves the execution list eagerly at construction time, and returns a selector closure. Pass the result as selector=.
# Keep only values from any execution of "Manual_Annotation" workflow
selector = FeatureRecord.select_by_workflow("Manual_Annotation", container=ml)
from_humans = list(ml.feature_values("Image", "Diagnosis", selector=selector))
# Scoped to a specific dataset (only executions that processed that dataset)
selector = FeatureRecord.select_by_workflow("Model_Inference", container=dataset)
model_preds = list(dataset.feature_values("Image", "Diagnosis", selector=selector))
container is required and keyword-only. It must be a DerivaML, Dataset, or DatasetBag instance — any object that implements list_workflow_executions(workflow) -> list[str]. The container determines the scope: DerivaML considers all catalog executions; Dataset and DatasetBag consider only executions within that dataset's scope.
Unknown-workflow errors surface at factory construction time, not during iteration.
select_majority_vote — consensus labeling
For multi-annotator scenarios, pick the most common value. The factory auto-detects the term column for single-term features.
selector = DiagnosisRecord.select_majority_vote() # auto-detects Diagnosis_Type
consensus = list(ml.feature_values("Image", "Diagnosis", selector=selector))
# For multi-term features, specify the column explicitly
selector = SomeRecord.select_majority_vote(column="MyTerm")
How to define asset features
Asset features link computed files — segmentation masks, embeddings, depth maps — to target records. They differ from term features in two ways:
- The feature column holds an asset RID, not a vocabulary term.
- During the execution write path, you provide a
Pathobject instead of a string. The library uploads the file to Hatrac and replaces the path with the catalog-side asset RID before inserting the feature row.
Setup
# Create the asset table (once, at schema setup time)
ml.create_asset("Embedding", comment="512-dim image embedding vectors")
# Create the feature
EmbeddingRecord = ml.create_feature(
target_table="Image",
feature_name="ResNet50_Embedding",
assets=["Embedding"],
comment="ResNet-50 embedding for contrastive learning",
)
Writing asset features
with ml.create_execution(config) as exe:
records = []
for image in images:
emb_path = exe.asset_file_path("Embedding", f"emb_{image['RID']}.npy")
np.save(emb_path, compute_embedding(image))
records.append(EmbeddingRecord(
Image=image["RID"],
Embedding=emb_path, # Path → resolved to RID on upload
))
exe.add_features(records)
exe.upload_execution_outputs() # assets uploaded first, then feature rows
Reading asset features
After upload, the Embedding column holds the asset's catalog RID. Use it to retrieve the file:
for rec in ml.feature_values("Image", "ResNet50_Embedding"):
print(f"Image {rec.Image} → asset RID {rec.Embedding}")
Notes
- The upload ordering contract — assets before features — is enforced by the library. You do not need to manage it.
- Mixed features (terms + assets) work the same way: put a
Pathin the asset column and a term name in the term column; both are resolved at upload time.
How to read features from a downloaded bag
DatasetBag implements the same three-method surface as DerivaML. You can call find_features, feature_values, and lookup_feature on a bag without a catalog connection.
bag = dataset.download_dataset_bag(version="2.0.0", materialize=True)
# Identical API to the live catalog
for rec in bag.feature_values("Image", "Diagnosis",
selector=FeatureRecord.select_newest):
print(f"{rec.Image}: {rec.Diagnosis_Type}")
select_by_workflow also works offline — pass container=bag:
selector = FeatureRecord.select_by_workflow("Manual_Annotation", container=bag)
human_labels = list(bag.feature_values("Image", "Diagnosis", selector=selector))
The bag reads from a per-feature denormalization cache in its local SQLite database. The first call for a feature populates the cache; subsequent calls are cheap. The cache is immutable — bags are read-only.
lookup_feature on a bag returns a Feature whose feature_record_class() works fully offline. This means you can construct FeatureRecord instances from a bag's data, then hand them to exe.add_features when you are back online:
# Offline: build records from bag data
feature = bag.lookup_feature("Image", "QualityScore")
RecordClass = feature.feature_record_class()
pending = [RecordClass(Image=r["RID"], QualityScore_Type="Good") for r in rows]
# Online: stage and upload
with ml.create_execution(online_config) as exe:
exe.add_features(pending)
exe.upload_execution_outputs()
!!! note
restructure_assets for organizing bag files into ML framework directory layouts is covered in Chapter 5 (Working offline). That chapter also covers the value_selector parameter for resolving multi-value features during asset restructuring.
When to reach for feature_values vs Denormalizer
Both tools produce tabular output from catalog data, but they answer different questions.
Use feature_values(table, feature_name, selector=...) when:
- You want the values for one specific feature.
- You want typed
FeatureRecordobjects with provenance attributes. - You want selector reduction (one value per target RID after filtering by workflow, annotator, or recency).
- You want to iterate over records and process them in Python before converting to a DataFrame.
# feature_values: typed records for one feature, selector-reduced
selector = FeatureRecord.select_by_workflow("Manual_Annotation", container=ml)
labels = {
rec.Image: rec.Diagnosis_Type
for rec in ml.feature_values("Image", "Diagnosis", selector=selector)
}
Use Denormalizer when:
- You want a wide DataFrame combining columns from multiple tables in a single call.
- You want structural joins (e.g.,
Subject → Observation → Image) that are not feature-based. - You want feature tables to participate as leaves in a multi-table join alongside domain tables.
- You want pandas output directly.
# Denormalizer: wide DataFrame joining Subject, Image, and a feature table
from deriva_ml.local_db.denormalize import Denormalizer
d = Denormalizer(dataset)
df = d.as_dataframe(["Subject", "Image", "Execution_Image_Diagnosis"])
# df has Subject columns, Image columns, and Diagnosis columns — one row per feature value
The same question — "give me the diagnosis for each image, with subject context" — answered both ways:
# feature_values approach: start from the feature, join subject in Python
records = {r.Image: r.Diagnosis_Type for r in ml.feature_values(
"Image", "Diagnosis", selector=FeatureRecord.select_newest
)}
# Then join with a subject lookup if needed
# Denormalizer approach: one call, pandas output with all context columns
df = dataset.get_denormalized_as_dataframe(
["Subject", "Observation", "Image", "Execution_Image_Diagnosis"]
)
Use feature_values when you care about provenance, selectors, or typed access. Use Denormalizer when you care about wide joins and pandas integration.
Migrating from pre-S2 APIs
If your code uses any of the following, it will raise DerivaMLException with a migration pointer:
| Retired API | Replacement |
|---|---|
ml.add_features(records) |
exe.add_features(records) inside an execution context |
ml.fetch_table_features(...) |
ml.feature_values(table, name) or Denormalizer |
bag.fetch_table_features(...) |
bag.feature_values(table, name) |
ml.list_feature_values(...) |
ml.feature_values(...) (renamed, same signature) |
bag.list_feature_values(...) |
bag.feature_values(...) (renamed, same signature) |
ml.select_by_workflow(records, workflow) |
FeatureRecord.select_by_workflow(workflow, container=ml) |
The retired APIs raise immediately with a specific message naming the replacement. There are no silent fallbacks.
Common pitfalls
!!! warning "feature_values materializes all rows before yielding"
Despite being iterator-shaped, feature_values fetches every row for the feature from the catalog in one call before yielding the first record. For features with very large value tables, apply a selector to reduce the output, or filter in your caller. This behavior is documented in the method's docstring and is by design — stream-safe is a future enhancement.
!!! warning "exe.add_features stages to SQLite, not to the catalog"
Records passed to exe.add_features are not visible in the catalog until upload_execution_outputs() completes. If you query feature_values on the live catalog before calling upload_execution_outputs(), the records from the current execution will not appear. This is intentional: partial execution output should not be visible to other readers.
!!! warning "Workflow deduplication affects select_by_workflow"
Workflows are deduplicated by checksum. If you run the same script multiple times, create_workflow returns the same workflow RID each time. FeatureRecord.select_by_workflow("My_Workflow", container=ml) will therefore match feature values from all runs of that script. If you need to distinguish runs, use select_by_execution with a specific execution RID, or create a new workflow for each run by changing the script or passing an explicit checksum.
See also
- API reference:
DerivaML.create_feature,DerivaML.feature_values,DerivaML.find_features,DerivaML.lookup_featurein the Library Documentation. - Chapter 2 (Datasets): how dataset versioning interacts with feature values via catalog snapshots.
- Chapter 4 (Executions):
create_execution,upload_execution_outputs,resume_execution, and the full execution lifecycle. - Chapter 5 (Working offline):
restructure_assetswithvalue_selectorfor feature-based file organization in offline ML workflows.