DerivaML Vocabulary¶
DerivaML is a class library built on the Deriva Scientific Asset management system that is designed to help simplify a number of the basic operations associated with building and testing ML libraries based on common toolkits such as TensorFlow. This notebook reviews the basic features of the DerivaML library.
A core aspect of DerivaML is the extensive use of controlled vocabulary terms. A vocabulary term may be something defined outside of the study, for example from an ontology like Uberon or Schema.org, or it can be a term that is defined and used locally by the ML team. The purpose of using controlled vocabulary is that it makes it easier to find data and can help ensure that proper communication is taking place between members of the ML team.
Preliminaries.¶
To start, we will do some preliminaries, loading needed modules and making sure we are logged into the DerivaML server.
from IPython.display import display, Markdown, HTML
import pandas as pd
from deriva.core.utils.globus_auth_utils import GlobusNativeLogin
from deriva_ml.demo_catalog import create_demo_catalog, DemoML
from deriva_ml import MLVocab
hostname = 'dev.eye-ai.org' # This needs to be changed.
gnl = GlobusNativeLogin(host=hostname)
if gnl.is_logged_in([hostname]):
print("You are already logged in.")
else:
gnl.login([hostname], no_local_server=True, no_browser=True, refresh_tokens=True, update_bdbag_keychain=True)
print("Login Successful")
You are already logged in.
Create a test catalog.¶
Create a test catalog and get an instance of the DerivaML class. This will take around 30 seconds, so be patient.
test_catalog = create_demo_catalog(hostname)
ml_instance = DemoML(hostname, test_catalog.catalog_id)
2025-06-06 14:12:47,103 - deriva_ml.WARNING - File /Users/carl/Repos/Projects/deriva-ml/docs/Notebooks/DerivaML Vocabulary.ipynb has been modified since last commit. Consider commiting before executing
Execution RID: https://dev.eye-ai.org/id/2060/3SC@33D-VDH5-6N1W
Explore existing vocabularies.¶
Get a list of all the currently defined controlled vocabularies
ml_instance.find_vocabularies()
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[4], line 1 ----> 1 ml_instance.find_vocabularies() AttributeError: 'DemoML' object has no attribute 'find_vocabularies'
Let's look at the contents of one of the predefined vocabularies in the DerivaML library. We can make this look nicer with a Panda.
Many of the datatypes in DerivaML are represented by Pydantic data classes. These have a number of methods that can make it easy to operate on them. The one we are going to use here is model_dump()
, which converts a dataclass into a dictionary.
display(
Markdown(f"#### Contents of controlled vocabulary {MLVocab.execution_metadata_type}"),
pd.DataFrame([v.model_dump() for v in ml_instance.list_vocabulary_terms(MLVocab.execution_metadata_type)])
)
Creating a new controlled vocabulary.¶
Now let's create a new controlled vocabulary to house terms that are specific to the problem we are working on.
ml_instance.create_vocabulary("My term set", comment="Terms to use for generating tests")
ml_instance.find_vocabularies()
Adding terms¶
Given our new controlled vocabulary, we can add terms to it. A term has a name, that should uniquely identify it within the vocabulary, a description of what the term means, and finally a list of synonyms. Each term is assigned a resource identifier (RID) by the deriva platform. There are other additional features of terms that facilitate integration from preexisting vocabularies that are beyond the scope of this notebook. You can look at the class documentation for these details.
for i in range(5):
ml_instance.add_term("My term set", f"Term{i}", description=f"My term {i}", synonyms=[f"t{i}", f"T{i}"])
display(
Markdown('#### Contents of controlled vocabulary "My term set'),
pd.DataFrame([v.model_dump() for v in ml_instance.list_vocabulary_terms("My term set")])
)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[2], line 3 1 display( 2 Markdown('#### Contents of controlled vocabulary "My term set'), ----> 3 pd.DataFrame([v.model_dump() for v in ml_instance.list_vocabulary_terms("My term set")]) 4 ) NameError: name 'ml_instance' is not defined
Looking up terms¶
We can also look up individual terms, either by their name, or by a synonym
display(
ml_instance.lookup_term("My term set", "Term0"),
ml_instance.lookup_term("My term set", "Term2"),
ml_instance.lookup_term('My term set', 'T3'),
)
Browsing terms in the user interface¶
All the terms we define in the API are of course visible via the Chaise use interface.
display(HTML(f'<a href={ml_instance.chaise_url("My term set")}>Browse vocabulary: My term set</a>'))
test_catalog.delete_ermrest_catalog(really=True)