Features
Features are a core concept in DerivaML for ML data engineering. They enable you to associate metadata with domain objects (like Images, Subjects, or any table in your schema) to support machine learning workflows.
What is a Feature?
A feature associates metadata values with records in your domain tables. Unlike regular table columns, features:
- Track Provenance: Every feature value records which Execution produced it
- Use Controlled Vocabularies: Categorical features use vocabulary terms for consistency
- Support Multiple Values: An object can have multiple values for the same feature
- Are Versioned: Feature values are included in dataset versions for reproducibility
Common Use Cases
| Use Case | Example | Feature Type |
|---|---|---|
| Ground truth labels | "Normal" vs "Abnormal" classification | Term-based |
| Model predictions | Inference results from a classifier | Term-based |
| Quality scores | Image quality ratings (1-5) | Term-based |
| Derived measurements | Computed metrics from analysis | Value-based |
| Related assets | Segmentation masks, embeddings | Asset-based |
Feature Types
Features can reference different types of values:
Term-Based Features
The most common type. Values come from controlled vocabulary tables, ensuring consistency and enabling queries across the vocabulary hierarchy.
# Create a vocabulary for diagnosis labels
ml.create_vocabulary("Diagnosis_Type", "Clinical diagnosis categories")
ml.add_term("Diagnosis_Type", "Normal", "No abnormality detected")
ml.add_term("Diagnosis_Type", "Abnormal", "Abnormality present")
# Create a feature that uses this vocabulary
ml.create_feature(
target_table="Image",
feature_name="Diagnosis",
terms=["Diagnosis_Type"],
comment="Clinical diagnosis for this image"
)
Asset-Based Features
Link derived assets (files) to domain objects. Useful for segmentation masks, embeddings, or any computed files.
# Create an asset table for segmentation masks
ml.create_asset("Segmentation_Mask", comment="Binary segmentation mask images")
# Create a feature linking masks to images
ml.create_feature(
target_table="Image",
feature_name="Segmentation",
assets=["Segmentation_Mask"],
comment="Segmentation mask for this image"
)
When you create an asset-based feature, the generated FeatureRecord class accepts either a file path or an asset RID for the asset column. During execution upload, file paths are automatically replaced with the RIDs of the uploaded assets.
Mixed Features
Features can reference both terms and assets for complex annotations.
Creating Feature Values
Feature values are created during Executions to maintain provenance. Every value knows which workflow produced it.
The workflow for adding feature values is:
1. Get the FeatureRecord class for your feature (via create_feature() or feature_record_class())
2. Create instances of the FeatureRecord with your data
3. Add the records within an execution using execution.add_features()
from deriva_ml import DerivaML
from deriva_ml.execution import ExecutionConfiguration
from deriva_ml.dataset import DatasetSpec
ml = DerivaML(hostname, catalog_id)
# Get the FeatureRecord class for the Diagnosis feature
DiagnosisFeature = ml.feature_record_class("Image", "Diagnosis")
# Set up execution
config = ExecutionConfiguration(
workflow=ml.create_workflow("Labeling", "Annotation"),
datasets=[DatasetSpec(rid=dataset_rid)],
)
with ml.create_execution(config) as exe:
# Get images to label
bag = exe.download_dataset_bag(DatasetSpec(rid=dataset_rid))
# Create feature records (provenance tracked automatically)
feature_records = []
for image in bag.list_dataset_members()["Image"]:
record = DiagnosisFeature(
Image=image["RID"], # Target record RID
Diagnosis_Type="Normal", # Vocabulary term name
)
feature_records.append(record)
# Add all feature records to the execution
exe.add_features(feature_records)
# Upload after execution context exits
exe.upload_execution_outputs()
Asset-Based Feature Values
For asset-based features, you provide file paths instead of vocabulary terms. The execution handles uploading the asset files and linking them to the feature records.
# Get the FeatureRecord class for an asset-based feature
SegmentationFeature = ml.feature_record_class("Image", "Segmentation")
config = ExecutionConfiguration(
workflow=ml.create_workflow("Segmentation", "Model_Inference"),
datasets=[DatasetSpec(rid=dataset_rid)],
)
with ml.create_execution(config) as exe:
bag = exe.download_dataset_bag(DatasetSpec(rid=dataset_rid))
feature_records = []
for image in bag.list_dataset_members()["Image"]:
# Create the asset file using asset_file_path
mask_path = exe.asset_file_path(
"Segmentation_Mask",
f"mask_{image['RID']}.png",
)
# Write the asset file (e.g., a segmentation mask)
generate_segmentation_mask(image, output_path=mask_path)
# Reference the file path in the feature record
record = SegmentationFeature(
Image=image["RID"],
Segmentation_Mask=mask_path, # File path, not an RID
)
feature_records.append(record)
exe.add_features(feature_records)
# Upload assets and feature values
exe.upload_execution_outputs()
During upload, the execution automatically:
- Uploads each asset file to the catalog's object store
- Replaces the file paths in the feature records with the RIDs of the uploaded assets
- Inserts the feature records into the catalog
After upload, querying the feature values returns asset RIDs rather than file paths.
Querying Feature Values
DerivaML provides several methods for retrieving feature values, from simple single-feature queries to bulk retrieval with deduplication.
Fetch All Features for a Table
fetch_table_features() retrieves all feature values for a table in a single call,
grouped by feature name. This is the most efficient way to get a complete picture of
the annotations on a table.
from deriva_ml.feature import FeatureRecord
# Get all features for Image — returns a dict keyed by feature name
features = ml.fetch_table_features("Image")
for name, records in features.items():
print(f"{name}: {len(records)} values")
# Get just one feature
features = ml.fetch_table_features("Image", feature_name="Diagnosis")
diagnosis_records = features["Diagnosis"]
Each record is a typed Pydantic model with attributes matching the feature's columns:
for record in diagnosis_records:
print(f"Image: {record.Image}")
print(f"Diagnosis: {record.Diagnosis_Type}")
print(f"Created by: {record.Execution}")
print(f"Created at: {record.RCT}")
Convert results to a pandas DataFrame for analysis:
import pandas as pd
df = pd.DataFrame([r.model_dump() for r in diagnosis_records])
List Values for a Single Feature
list_feature_values() is a convenience wrapper when you only need one feature.
It returns a flat list instead of a dictionary.
# Get all diagnosis values across all images
for v in ml.list_feature_values("Image", "Diagnosis"):
print(f"Image {v.Image}: {v.Diagnosis_Type} (by Execution {v.Execution})")
Resolving Multiple Values with Selectors
When the same object has multiple values for a feature — for example, labels from different annotators or predictions from successive model runs — you can pass a selector function to pick one value per object.
A selector is any callable with signature (list[FeatureRecord]) -> FeatureRecord.
It receives all records for a single target object and returns the one to keep.
Built-in: select_newest
FeatureRecord.select_newest picks the record with the most recent RCT (Row Creation
Time). This is the most common selector.
from deriva_ml.feature import FeatureRecord
# Deduplicate to one value per image, keeping the newest
features = ml.fetch_table_features(
"Image",
feature_name="Diagnosis",
selector=FeatureRecord.select_newest,
)
# Also works with list_feature_values
newest_values = list(ml.list_feature_values(
"Image", "Diagnosis",
selector=FeatureRecord.select_newest,
))
Custom Selectors
Write your own selector for domain-specific logic:
# Pick the record with the highest confidence score
def select_highest_confidence(records):
return max(records, key=lambda r: getattr(r, "Confidence", 0))
features = ml.fetch_table_features(
"Image",
feature_name="Diagnosis",
selector=select_highest_confidence,
)
Selecting by Workflow
select_by_workflow() filters feature records to those produced by a specific workflow
or workflow type, then returns the newest match. This is useful when you have labels
from multiple sources (e.g., manual annotation vs. model inference) and want to use
values from a particular one.
Unlike select_newest, this method requires catalog access to look up executions,
so it cannot be passed as a selector argument. Instead, call it directly on a
group of records.
from collections import defaultdict
# Get all classification values
all_values = list(ml.list_feature_values("Image", "Classification"))
# Group by image
by_image = defaultdict(list)
for v in all_values:
by_image[v.Image].append(v)
# Select the newest value from any "Model_Inference" workflow type
selected = {}
for image_rid, records in by_image.items():
selected[image_rid] = ml.select_by_workflow(records, "Model_Inference")
The workflow argument accepts either a Workflow RID or a Workflow_Type name.
DerivaML auto-detects which one you provided:
# By workflow type name — selects from any workflow of this type
record = ml.select_by_workflow(records, "Training")
# By specific workflow RID — selects from executions of this exact workflow
record = ml.select_by_workflow(records, "2-ABC1")
If no records match the specified workflow, a DerivaMLException is raised.
Find Feature Definitions
These methods inspect the catalog schema to discover what features exist. They return
Feature objects describing the feature structure, not feature values.
# What features are defined for images?
features = ml.find_features("Image")
for f in features:
print(f" {f.feature_name}: {f.feature_table.name}")
# List all features in the catalog
all_features = ml.find_features()
Get Feature Structure
lookup_feature() returns a Feature schema descriptor for a single feature. Use it
to inspect what columns a feature has and what types of values it accepts.
# Examine a specific feature's structure
feature = ml.lookup_feature("Image", "Diagnosis")
print(f"Target: {feature.target_table.name}")
print(f"Feature table: {feature.feature_table.name}")
print(f"Term columns: {[c.name for c in feature.term_columns]}")
print(f"Asset columns: {[c.name for c in feature.asset_columns]}")
print(f"Value columns: {[c.name for c in feature.value_columns]}")
The Feature object also provides feature_record_class(), which returns the
dynamically generated Pydantic model for constructing new feature records:
DiagnosisRecord = feature.feature_record_class()
# Equivalent to: DiagnosisRecord = ml.feature_record_class("Image", "Diagnosis")
Feature Tables
When you create a feature, DerivaML creates an association table with:
| Column | Purpose |
|---|---|
{TargetTable} |
RID of the domain object (e.g., Image RID) |
Feature_Name |
The feature name (from Feature_Name vocabulary) |
Execution |
RID of the execution that produced this value |
{VocabTable} |
RID of the vocabulary term (for term-based features) |
{AssetTable} |
RID of the asset (for asset-based features) |
This structure enables: - Querying all values for a feature - Finding which execution produced a value - Joining with vocabulary tables for term labels - Multiple values per object (many-to-many relationship)
Best Practices
Feature Naming
- Use descriptive names:
Diagnosis,Quality_Score,Predicted_Class - Feature names are controlled vocabulary terms in
Feature_Nametable - Same feature name can be used across different tables
Vocabulary Design
- Create vocabularies before features that use them
- Include synonyms for flexible matching
- Add descriptions to help users understand term meanings
Provenance
- Always create feature values within an Execution context
- Use meaningful workflow types: "Manual_Annotation", "Model_Inference", etc.
- Include dataset versions for reproducibility
Working with Multiple Values
A single object can have multiple values for the same feature. This is common when:
- Multiple annotators label the same image
- A model produces predictions at different times
- Different versions of analysis are run
Querying Multiple Values
# Get all values for a specific image
values = list(ml.list_feature_values("Image", "Diagnosis"))
image_values = [v for v in values if v.Image == image_rid]
for v in image_values:
print(f"Value: {v.Diagnosis_Type} from Execution {v.Execution} at {v.RCT}")
Deduplicating Values
Use a selector to keep one value per object (see Resolving Multiple Values with Selectors above):
from deriva_ml.feature import FeatureRecord
# Keep only the newest value per image
newest = list(ml.list_feature_values(
"Image", "Diagnosis",
selector=FeatureRecord.select_newest,
))
# Or keep values from a specific workflow
from collections import defaultdict
all_values = list(ml.list_feature_values("Image", "Diagnosis"))
by_image = defaultdict(list)
for v in all_values:
by_image[v.Image].append(v)
from_training = {
img: ml.select_by_workflow(recs, "Manual_Annotation")
for img, recs in by_image.items()
}
Resolving Multiple Values in restructure_assets
When restructuring assets by a feature that has multiple values, you can provide a value_selector function to choose which value to use:
from deriva_ml.dataset.dataset_bag import FeatureValueRecord
def select_latest(records: list[FeatureValueRecord]) -> FeatureValueRecord:
"""Select the value with the most recent creation time."""
return max(records, key=lambda r: r.raw_record.get("RCT", "") or "")
bag.restructure_assets(
asset_table="Image",
output_dir=Path("./ml_data"),
group_by=["Diagnosis"],
value_selector=select_latest,
)
The FeatureValueRecord provides:
| Attribute | Description |
|---|---|
target_rid |
RID of the object this value applies to |
feature_name |
Name of the feature |
value |
The feature value (e.g., vocabulary term name) |
execution_rid |
RID of the execution that created this value |
raw_record |
Complete feature table row as a dictionary |
Consensus or Aggregation
For more complex scenarios, you might aggregate multiple values:
from collections import Counter
def select_majority_vote(records: list[FeatureValueRecord]) -> FeatureValueRecord:
"""Select the most common value (majority vote)."""
counts = Counter(r.value for r in records)
most_common_value = counts.most_common(1)[0][0]
return next(r for r in records if r.value == most_common_value)
Deleting Features
# WARNING: This permanently removes the feature and all its values
ml.delete_feature("Image", "Diagnosis")
Deletion removes: - The feature table - All feature values - All provenance information for this feature
Features in Datasets
Feature values are included when you:
- Export a dataset: Feature tables are exported as CSVs in the BDBag
- Download a dataset bag: Feature values are loaded into the local SQLite database
- Version a dataset: Feature values at that version are preserved via catalog snapshots
This ensures ML workflows have access to the labels and annotations associated with dataset elements.
The same query methods work on dataset bags for offline access:
bag = ml.download_dataset_bag(DatasetSpec(rid=dataset_rid, version="1.2.0"))
# fetch_table_features and list_feature_values work on bags
features = bag.fetch_table_features("Image")
values = list(bag.list_feature_values("Image", "Diagnosis",
selector=FeatureRecord.select_newest))
Note that select_by_workflow is not available on dataset bags since it requires
live catalog access to look up workflow and execution records.