DerivaML Features¶

DerivaML is a class library built on the Deriva Scientific Asset management system that is designed to help simplify a number of the basic operations associated with building and testing ML libraries based on common toolkits such as TensorFlow. This notebook reviews the basic features of the DerivaML library.

In DerivaML, "features" are the way we attach values to objects in the catalog. A feature could be a computed value that serves as input to a ML model, or it could be a label, that is the result of running a model. A feature can be a controlled vocabulary term, an asset, or a value.

Each feature in the catalog is distinguished by the name of the feature, the identity of the object that the feature is being attached to, and the execution RID of the process that generated the feature value

Set up Deriva for test case¶

In [ ]:

Copied!

%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2

In [ ]:

Copied!





import builtins
from deriva.core.utils.globus_auth_utils import GlobusNativeLogin
from deriva_ml import ColumnDefinition, BuiltinTypes, MLVocab, DerivaSystemColumns
from deriva_ml.demo_catalog import create_demo_catalog, DemoML
from deriva_ml import ExecutionConfiguration
from IPython.display import display, Markdown, HTML
import itertools
import pandas as pd
import random
import builtins
from deriva.core.utils.globus_auth_utils import GlobusNativeLogin
from deriva_ml import ColumnDefinition, BuiltinTypes, MLVocab, DerivaSystemColumns
from deriva_ml.demo_catalog import create_demo_catalog, DemoML
from deriva_ml import ExecutionConfiguration
from IPython.display import display, Markdown, HTML
import itertools
import pandas as pd
import random

Set the details for the catalog we want and authenticate to the server if needed.

In [ ]:

Copied!





hostname = 'localhost'
domain_schema = 'demo-schema'

gnl = GlobusNativeLogin(host=hostname)
if gnl.is_logged_in([hostname]):
    print("You are already logged in.")
else:
    gnl.login([hostname], no_local_server=True, no_browser=True, refresh_tokens=True, update_bdbag_keychain=True)
    print("Login Successful")
hostname = 'localhost'
domain_schema = 'demo-schema'

gnl = GlobusNativeLogin(host=hostname)
if gnl.is_logged_in([hostname]):
    print("You are already logged in.")
else:
    gnl.login([hostname], no_local_server=True, no_browser=True, refresh_tokens=True, update_bdbag_keychain=True)
    print("Login Successful")

Create a test catalog and get an instance of the DerivaML class.

In [ ]:

Copied!

test_catalog = create_demo_catalog(hostname, domain_schema)
ml_instance = DemoML(hostname, test_catalog.catalog_id)
display(f"Created demo catalog at {hostname}:{test_catalog.catalog_id}")
test_catalog = create_demo_catalog(hostname, domain_schema)
ml_instance = DemoML(hostname, test_catalog.catalog_id)
display(f"Created demo catalog at {hostname}:{test_catalog.catalog_id}")

Define Features¶

A feature is a set of values that are attached to a table in the DerivaML catalog. Instances of features are distinguished from one another by the ID of the execution that produced the feature value. The execution could be the result of a program, or it could be a manual process by which a person defines a set of values

To create a new feature, we need to know the name of the feature, the table to which it is attached, and the set of values that make up the feature. The values could be terms from a controlled vocabulary, a set of one or more file based assets, or other values, such as integers, or strings. However, use of strings outside of controlled vocabularies is discouraged.

For our example, we are going to define three features. Two of them will use values from a controlled vocabulary, which we need to create. The third feature will consist of a file whose contents we will generate. To start, we will need to create the controlled vocabularies, and create an asset table for the feature values.

In [ ]:

Copied!





# Prerequisites for our feature, which will include a CV term and asset.

# Create a vocabulary and add a term to it to use in our features.
ml_instance.create_vocabulary("SubjectHealth", "A vocab")
ml_instance.add_term("SubjectHealth", "Sick", description="The subject self reports that they are sick")
ml_instance.add_term("SubjectHealth", "Well", description="The subject self reports that they feel well")

ml_instance.create_vocabulary("ImageQuality", "Controlled vocabulary for image quality")
ml_instance.add_term("ImageQuality", "Good", description="The image is good")
ml_instance.add_term("ImageQuality", "Bad", description="The image is bad")

box_asset = ml_instance.create_asset("BoundingBox", comment="A file that contains a cropped version of a image")
# Prerequisites for our feature, which will include a CV term and asset.

# Create a vocabulary and add a term to it to use in our features.
ml_instance.create_vocabulary("SubjectHealth", "A vocab")
ml_instance.add_term("SubjectHealth", "Sick", description="The subject self reports that they are sick")
ml_instance.add_term("SubjectHealth", "Well", description="The subject self reports that they feel well")

ml_instance.create_vocabulary("ImageQuality", "Controlled vocabulary for image quality")
ml_instance.add_term("ImageQuality", "Good", description="The image is good")
ml_instance.add_term("ImageQuality", "Bad", description="The image is bad")

box_asset = ml_instance.create_asset("BoundingBox", comment="A file that contains a cropped version of a image")

We are now ready to create our new features. Each feature will be associated with a table, have a name, and then the set of values that define the feature. After we create the features, we can list the features associated with each table type that we have.

In [ ]:

Copied!





ml_instance.create_feature("Subject", "Health",
                                        terms=["SubjectHealth"],
                                        metadata=[ColumnDefinition(name='Scale', type=BuiltinTypes.int2, nullok=True)],
                           optional=['Scale'])

ml_instance.create_feature('Image', 'BoundingBox', assets=[box_asset])
ml_instance.create_feature('Image', 'Quality', terms=["ImageQuality"])

display(
    [f'{f.target_table.name}:{f.feature_name}' for f in ml_instance.find_features("Subject")],
    [f'{f.target_table.name}:{f.feature_name}' for f in ml_instance.find_features("Image")]
)
ml_instance.create_feature("Subject", "Health",
                                        terms=["SubjectHealth"],
                                        metadata=[ColumnDefinition(name='Scale', type=BuiltinTypes.int2, nullok=True)],
                           optional=['Scale'])

ml_instance.create_feature('Image', 'BoundingBox', assets=[box_asset])
ml_instance.create_feature('Image', 'Quality', terms=["ImageQuality"])

display(
    [f'{f.target_table.name}:{f.feature_name}' for f in ml_instance.find_features("Subject")],
    [f'{f.target_table.name}:{f.feature_name}' for f in ml_instance.find_features("Image")]
)

Now we can add some features to our images. To streamline the creation of new feature, we create a class that is specific to the arguments required to create it.

In [ ]:

Copied!





ImageQualityFeature = ml_instance.feature_record_class("Image", "Quality")
ImageBoundingboxFeature = ml_instance.feature_record_class("Image", "BoundingBox")
SubjectWellnessFeature= ml_instance.feature_record_class("Subject", "Health")

display(
    Markdown('### SubjectWellnessFeature'),
    Markdown(f'* feature_columns: ' f'```{[c.name for c in SubjectWellnessFeature.feature_columns()]}```'),
    Markdown(f'* required columns: ' f'```{[c.name  for c in SubjectWellnessFeature.feature_columns() if not c.nullok]}```'),
    Markdown(f'* term columns: ' f'```{[c.name for c in SubjectWellnessFeature.term_columns()]}```'),
    Markdown(f'* value columns: ' f'```{[c.name for c in SubjectWellnessFeature.value_columns()]}```'),
    Markdown(f'* asset columns: ' f'```{[c.name for c in SubjectWellnessFeature.asset_columns()]}```'),

    Markdown('### ImageQualityFeature'),
    Markdown( f'* feature_columns:* ' f'```{[c.name for c in ImageQualityFeature.feature_columns()]}```'),
    Markdown(f'*  required columns:* ' f'```{[c.name  for c in ImageQualityFeature.feature_columns() if not c.nullok]}```'),
    Markdown(f'* term columns: * ' f'```{[c.name for c in ImageQualityFeature.term_columns()]}```'),
    Markdown(f'* value columns: * ' f'```{[c.name for c in ImageQualityFeature.value_columns()]}```'),
    Markdown(f'* asset columns: * ' f'```{[c.name for c in ImageQualityFeature.asset_columns()]}```'),

    Markdown('### ImageBoundingboxFeature'),
    Markdown( f'* feature_columns:* ' f'```{[c.name for c in ImageBoundingboxFeature.feature_columns()]}```'),
    Markdown(f'* required columns:* ' f'```{[c.name  for c in ImageBoundingboxFeature.feature_columns() if not c.nullok]}```'),
    Markdown( f'* term columns:* ' f'```{[c.name for c in ImageBoundingboxFeature.term_columns()]}```'),
    Markdown( f'* value columns:* ' f'```{[c.name for c in ImageBoundingboxFeature.value_columns()]}```'),
    Markdown( f'* asset columns:* ' f'```{[c.name for c in ImageBoundingboxFeature.asset_columns()]}```'),
)
ImageQualityFeature = ml_instance.feature_record_class("Image", "Quality")
ImageBoundingboxFeature = ml_instance.feature_record_class("Image", "BoundingBox")
SubjectWellnessFeature= ml_instance.feature_record_class("Subject", "Health")

display(
    Markdown('### SubjectWellnessFeature'),
    Markdown(f'* feature_columns: ' f'```{[c.name for c in SubjectWellnessFeature.feature_columns()]}```'),
    Markdown(f'* required columns: ' f'```{[c.name  for c in SubjectWellnessFeature.feature_columns() if not c.nullok]}```'),
    Markdown(f'* term columns: ' f'```{[c.name for c in SubjectWellnessFeature.term_columns()]}```'),
    Markdown(f'* value columns: ' f'```{[c.name for c in SubjectWellnessFeature.value_columns()]}```'),
    Markdown(f'* asset columns: ' f'```{[c.name for c in SubjectWellnessFeature.asset_columns()]}```'),

    Markdown('### ImageQualityFeature'),
    Markdown( f'* feature_columns:* ' f'```{[c.name for c in ImageQualityFeature.feature_columns()]}```'),
    Markdown(f'*  required columns:* ' f'```{[c.name  for c in ImageQualityFeature.feature_columns() if not c.nullok]}```'),
    Markdown(f'* term columns: * ' f'```{[c.name for c in ImageQualityFeature.term_columns()]}```'),
    Markdown(f'* value columns: * ' f'```{[c.name for c in ImageQualityFeature.value_columns()]}```'),
    Markdown(f'* asset columns: * ' f'```{[c.name for c in ImageQualityFeature.asset_columns()]}```'),

    Markdown('### ImageBoundingboxFeature'),
    Markdown( f'* feature_columns:* ' f'```{[c.name for c in ImageBoundingboxFeature.feature_columns()]}```'),
    Markdown(f'* required columns:* ' f'```{[c.name  for c in ImageBoundingboxFeature.feature_columns() if not c.nullok]}```'),
    Markdown( f'* term columns:* ' f'```{[c.name for c in ImageBoundingboxFeature.term_columns()]}```'),
    Markdown( f'* value columns:* ' f'```{[c.name for c in ImageBoundingboxFeature.value_columns()]}```'),
    Markdown( f'* asset columns:* ' f'```{[c.name for c in ImageBoundingboxFeature.asset_columns()]}```'),
)

Add feature values¶

Now using feature classes, we can create some instances of the feature and add them. We must have a execution_rid in order to define the feature. In our example, we will assume that the execution that calculates the feature values will use a model file to configure it, so ww will need to create and upload the file before we can start the execution.

In [ ]:

Copied!





ml_instance.add_term(MLVocab.workflow_type, "Feature Notebook Workflow", description="A Workflow that uses Deriva ML API")
ml_instance.add_term(MLVocab.asset_type, "API_Model", description="Model for our Notebook workflow")

# Get the workflow for this notebook
notebook_workflow = ml_instance.create_workflow(
    name="API Workflow", 
    workflow_type="Feature Notebook Workflow"
)

feature_execution = ml_instance.create_execution(
    ExecutionConfiguration(
        workflow=notebook_workflow,
        description="Our Sample Workflow instance")
)
ml_instance.add_term(MLVocab.workflow_type, "Feature Notebook Workflow", description="A Workflow that uses Deriva ML API")
ml_instance.add_term(MLVocab.asset_type, "API_Model", description="Model for our Notebook workflow")

# Get the workflow for this notebook
notebook_workflow = ml_instance.create_workflow(
    name="API Workflow", 
    workflow_type="Feature Notebook Workflow"
)

feature_execution = ml_instance.create_execution(
    ExecutionConfiguration(
        workflow=notebook_workflow,
        description="Our Sample Workflow instance")
)

In [ ]:

Copied!

# Get the IDs of al of the things that we are going to want to attach features to.
subject_rids = [i['RID'] for i in ml_instance.domain_path.tables['Subject'].entities().fetch()]
image_rids = [i['RID'] for i in ml_instance.domain_path.tables['Image'].entities().fetch()]
# Get the IDs of al of the things that we are going to want to attach features to.
subject_rids = [i['RID'] for i in ml_instance.domain_path.tables['Subject'].entities().fetch()]
image_rids = [i['RID'] for i in ml_instance.domain_path.tables['Image'].entities().fetch()]

Now that we have the list of objects that we want to add features to, we can define the sets of feature values we want to record and then add these features in the catalog.

In [ ]:

Copied!





# Create a new set of images.  For fun, lets wrap this in an execution so we get status updates
image_bounding_box_feature_list = []
for cnt, image_rid in enumerate(image_rids):
    bounding_box_file = feature_execution.asset_file_path("BoundingBox", f"box{cnt}.txt")
    with open(bounding_box_file, "w") as fp:
        fp.write(f"Hi there {cnt}")
    image_bounding_box_feature_list.append(   ImageBoundingboxFeature(Image=image_rid, BoundingBox=bounding_box_file)
    )

image_quality_feature_list = [
    ImageQualityFeature(
        Image=image_rid,
        ImageQuality=["Good", "Bad"][random.randint(0, 1)],
    )
    for image_rid in image_rids
]

subject_feature_list = [
    SubjectWellnessFeature(
        Subject=subject_rid,
        SubjectHealth=["Well", "Sick"][random.randint(0, 1)],
        Scale=random.randint(1, 10),
    )
    for subject_rid in subject_rids
]

with feature_execution.execute() as execution:
    feature_execution.add_features(image_bounding_box_feature_list)
    feature_execution.add_features(image_quality_feature_list)
    feature_execution.add_features(subject_feature_list)

# Upload all of the new assets that we have created during the execution.
feature_execution.upload_execution_outputs()
# Create a new set of images.  For fun, lets wrap this in an execution so we get status updates
image_bounding_box_feature_list = []
for cnt, image_rid in enumerate(image_rids):
    bounding_box_file = feature_execution.asset_file_path("BoundingBox", f"box{cnt}.txt")
    with open(bounding_box_file, "w") as fp:
        fp.write(f"Hi there {cnt}")
    image_bounding_box_feature_list.append(   ImageBoundingboxFeature(Image=image_rid, BoundingBox=bounding_box_file)
    )

image_quality_feature_list = [
    ImageQualityFeature(
        Image=image_rid,
        ImageQuality=["Good", "Bad"][random.randint(0, 1)],
    )
    for image_rid in image_rids
]

subject_feature_list = [
    SubjectWellnessFeature(
        Subject=subject_rid,
        SubjectHealth=["Well", "Sick"][random.randint(0, 1)],
        Scale=random.randint(1, 10),
    )
    for subject_rid in subject_rids
]

with feature_execution.execute() as execution:
    feature_execution.add_features(image_bounding_box_feature_list)
    feature_execution.add_features(image_quality_feature_list)
    feature_execution.add_features(subject_feature_list)

# Upload all of the new assets that we have created during the execution.
feature_execution.upload_execution_outputs()

In [ ]:

Copied!





display(
    Markdown('### Wellness'),
    pd.DataFrame(ml_instance.list_feature_values("Subject", "Health")).drop(columns=DerivaSystemColumns + ['Feature_Name']),
    Markdown('### Image Quality'),
    pd.DataFrame(ml_instance.list_feature_values("Image", "Quality")).drop(columns=DerivaSystemColumns + ['Feature_Name']),
    Markdown('### BoundingBox'),
    pd.DataFrame(ml_instance.list_feature_values("Image", "BoundingBox")).drop(columns=DerivaSystemColumns + ['Feature_Name']),
)
display(
    Markdown('### Wellness'),
    pd.DataFrame(ml_instance.list_feature_values("Subject", "Health")).drop(columns=DerivaSystemColumns + ['Feature_Name']),
    Markdown('### Image Quality'),
    pd.DataFrame(ml_instance.list_feature_values("Image", "Quality")).drop(columns=DerivaSystemColumns + ['Feature_Name']),
    Markdown('### BoundingBox'),
    pd.DataFrame(ml_instance.list_feature_values("Image", "BoundingBox")).drop(columns=DerivaSystemColumns + ['Feature_Name']),
)

In [ ]:

Copied!

display(HTML(f'<a href={ml_instance.chaise_url("Subject")}>Browse Subject Table</a>'))
display(HTML(f'Browse Subject Table'))

In [ ]:

Copied!

test_catalog.delete_ermrest_catalog(really=True)
test_catalog.delete_ermrest_catalog(really=True)