Creating a New Notebook

This guide walks you through adding a new analysis notebook to your DerivaML project.

Overview

Adding a new notebook involves three steps:

Create the configuration - Define parameters for your notebook
Create the notebook - Use run_notebook() for initialization
Test and run - Verify it works end-to-end

Step 1: Create the Configuration

Create a configuration file in src/configs/ for your notebook.

Simple Notebook (Standard Parameters Only)

If your notebook only needs the standard fields (assets, datasets, deriva_ml):

# src/configs/my_analysis.py
"""Configuration for my analysis notebook.

This module registers the configuration for the my_analysis notebook.
"""
from deriva_ml.execution import notebook_config

notebook_config(
    "my_analysis",
    defaults={
        "assets": "my_assets",      # Which asset group to use
        "datasets": "my_datasets",   # Which dataset group to use
    },
    description="My analysis notebook",
)

Notebook with Custom Parameters

If your notebook needs additional configuration options:

# src/configs/my_analysis.py
"""Configuration for my analysis notebook with custom parameters."""
from dataclasses import dataclass

from deriva_ml.execution import BaseConfig, notebook_config


@dataclass
class MyAnalysisConfig(BaseConfig):
    """Configuration for my analysis notebook.

    Attributes:
        threshold: Confidence threshold for predictions.
        show_plots: Whether to display plots inline.
        output_format: Format for output files ('csv' or 'json').
    """
    threshold: float = 0.5
    show_plots: bool = True
    output_format: str = "csv"


notebook_config(
    "my_analysis",
    config_class=MyAnalysisConfig,
    defaults={"assets": "my_assets"},
    description="My analysis with custom parameters",
)

Configuration Tips

Use descriptive field names that make the configuration self-documenting
Provide sensible defaults so notebooks run without extra configuration
Add docstrings to explain what each parameter controls
Match defaults to asset/dataset groups you've defined in assets.py and datasets.py

Step 2: Create the Notebook

Create your notebook in the notebooks/ directory. Use the run_notebook() API for initialization:

# Cell 1: Initialization
from deriva_ml.execution import run_notebook

# Initialize with one call - this handles:
# - Loading configuration
# - Connecting to DerivaML
# - Creating workflow and execution
# - Downloading input datasets
ml, execution, config = run_notebook("my_analysis")

# Access configuration values
print(f"Threshold: {config.threshold}")
print(f"Show plots: {config.show_plots}")
print(f"Output format: {config.output_format}")

# Access standard fields
print(f"Assets: {config.assets}")
print(f"Datasets: {config.datasets}")

# Cell 2: Your analysis code
import pandas as pd

# Access downloaded datasets
for dataset in execution.datasets:
    print(f"Processing dataset: {dataset.rid}")
    # Dataset files are already downloaded to execution.execution_working_dir

# Use configuration values
if config.threshold > 0:
    # Apply threshold filtering...
    pass

if config.show_plots:
    # Display plots...
    pass

# Cell 3: Save outputs
# Register output files for upload
output_path = execution.asset_file_path("Execution_Metadata", f"results.{config.output_format}")

# Write your results
# results_df.to_csv(output_path)

print(f"Results saved to: {output_path}")

# Cell 4: Upload (final cell)
# Upload all registered outputs to the catalog
execution.upload_execution_outputs()
print("Outputs uploaded successfully!")

Step 3: Test Your Notebook

Interactive Testing

Open the notebook in JupyterLab: bash uv run jupyter lab
Select your repository's kernel (e.g., your-repo-name)
Run cells interactively to debug

Command Line Testing

# Show available configuration options
uv run deriva-ml-run-notebook notebooks/my_analysis.ipynb --info

# Dry run (if supported by your notebook)
uv run deriva-ml-run-notebook notebooks/my_analysis.ipynb \
  dry_run=true

# Full run with default configuration
uv run deriva-ml-run-notebook notebooks/my_analysis.ipynb

# Run with overrides
uv run deriva-ml-run-notebook notebooks/my_analysis.ipynb \
  threshold=0.8 show_plots=false

Common Patterns

Accessing Dataset Files

# Datasets are downloaded during run_notebook()
# Access them via execution
for dataset_spec in execution.datasets:
    dataset_dir = execution.execution_working_dir / "datasets" / dataset_spec.rid
    # Process files in dataset_dir

Accessing Input Assets

# Input assets are specified in config.assets
for asset_rid in config.assets:
    # Download or reference the asset
    pass

Registering Multiple Output Types

# Register different output types
plots_path = execution.asset_file_path("Image", "analysis_plot.png")
data_path = execution.asset_file_path("Execution_Metadata", "results.csv")
model_path = execution.asset_file_path("Model", "trained_model.pt")

# Save your files to these paths...

Conditional Execution

# Use config values to control behavior
if config.show_plots:
    import matplotlib.pyplot as plt
    # Generate plots...
    plt.savefig(plots_path)

if config.output_format == "json":
    results.to_json(data_path)
else:
    results.to_csv(data_path)

Complete Example: ROC Analysis

See src/configs/roc_analysis.py and notebooks/roc_analysis.ipynb for a complete working example that:

Uses custom configuration parameters (show_per_class, confidence_threshold)
Processes probability files from model outputs
Generates ROC curves
Uploads analysis results to the catalog

Checklist

Before running your notebook in production:

[ ] Configuration file created in src/configs/
[ ] Notebook uses run_notebook() for initialization
[ ] All outputs registered with execution.asset_file_path()
[ ] Final cell calls execution.upload_execution_outputs()
[ ] Notebook runs start-to-finish without intervention
[ ] Commit all changes to Git
[ ] Test with --info flag to verify configuration
[ ] Strip output cells before committing (nbstripout should handle this)