Dataset Auxiliary Classes
Supporting classes for dataset operations including version management, dataset specifications, and history tracking.
Auxiliary classes for dataset versioning, history, and configuration.
This module defines VersionPart, DatasetVersion, DatasetHistory, DatasetMinid, DatasetSpec, and DatasetSpecConfig -- the value objects used throughout DerivaML to represent dataset versions, provenance records, and hydra-zen configuration entries.
DatasetHistory
Bases: BaseModel
Class representing a dataset history.
Attributes:
| Name | Type | Description |
|---|---|---|
dataset_version |
DatasetVersion
|
A DatasetVersion object which captures the PEP 440 version of the dataset. |
dataset_rid |
RID
|
The RID of the dataset. |
version_rid |
RID
|
The RID of the version record for the dataset in the Dataset_Version table. |
execution_rid |
RID | None
|
RID of the execution that created this version, or None when the version was created outside an execution. |
description |
str | None
|
Human-readable description recorded with this version (empty string when none was supplied). |
minid |
str
|
The URL that represents the handle of the dataset bag. This will be None if a MINID has not been created yet. |
spec_hash |
str | None
|
Hash of the dataset spec used to build this version, or None when not recorded. Used to detect cache hits. |
snapshot |
str
|
Catalog snapshot ID of when the version record was created. |
Source code in src/deriva_ml/dataset/aux_classes.py
206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 | |
DatasetMinid
Bases: BaseModel
Represent information about a MINID that refers to a dataset
Attributes:
| Name | Type | Description |
|---|---|---|
dataset_version |
DatasetVersion
|
A DatasetVersion object which captures the PEP 440 version of the dataset. |
metadata |
dict
|
A dictionary containing metadata from the MINID landing page. |
minid |
str
|
The URL that represents the handle of the MINID associated with the dataset. |
bag_url |
str
|
The URL to the dataset bag |
identifier |
str
|
The identifier of the MINID in CURI form |
landing_page |
str
|
The URL to the landing page of the MINID |
version_rid |
str
|
RID of the dataset version. |
checksum |
str
|
The checksum of the MINID in SHA256 form |
Source code in src/deriva_ml/dataset/aux_classes.py
246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 | |
DatasetSpec
Bases: BaseModel
Represent a dataset_table in an execution configuration dataset_table list
Attributes:
| Name | Type | Description |
|---|---|---|
rid |
RID
|
A dataset_table RID |
materialize |
bool
|
If False do not materialize datasets, only download table data, no assets. Defaults to True |
version |
DatasetVersion
|
The version of the dataset. Should follow PEP 440 (see ADR-0004). |
description |
str
|
Optional human-readable note describing this spec's role in the configuration. Defaults to the empty string. |
exclude_tables |
set[str] | None
|
Optional set of table names to exclude from FK path traversal during bag export. Tables in this set will not be visited, pruning branches of the FK graph. Useful for avoiding query timeouts on large tables. |
timeout |
tuple[int, int] | None
|
Optional (connect_timeout, read_timeout) in seconds for network requests during bag download. Defaults to (10, 610) if not specified. Increase read_timeout for large datasets with deep FK joins. |
fetch_concurrency |
int
|
Number of concurrent fetch threads for asset download during materialization. Defaults to 8. |
Source code in src/deriva_ml/dataset/aux_classes.py
312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 | |
from_shorthand
classmethod
from_shorthand(s: str) -> DatasetSpec
Parse 'RID@version' into a :class:DatasetSpec.
Used by the :meth:DerivaML.create_execution kwargs form so
callers can write datasets=["1-XYZ@1.0.0"] instead of
instantiating a full DatasetSpec by hand. Accepts both
'RID' (bare RID; version defaults to 0.0.0) and
'RID@version' (semantic version string).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
s
|
str
|
The shorthand string. Must contain at most one |
required |
Returns:
| Name | Type | Description |
|---|---|---|
A |
DatasetSpec
|
class: |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the string is empty or contains more than
one |
Example
Parse a shorthand with explicit version::
>>> spec = DatasetSpec.from_shorthand("1-XYZ0@2.0.0")
>>> spec.rid
'1-XYZ0'
>>> str(spec.version)
'2.0.0'
Parse a bare RID (version defaults to 0.0.0)::
>>> spec = DatasetSpec.from_shorthand("1-XYZ0")
>>> spec.rid
'1-XYZ0'
Source code in src/deriva_ml/dataset/aux_classes.py
363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 | |
DatasetSpecConfig
Hydra-zen configuration dataclass for :class:DatasetSpec.
Use this in hydra-zen store() calls and configuration modules to
specify dataset inputs. When instantiated by hydra-zen, it produces a
:class:DatasetSpec instance.
Attributes:
| Name | Type | Description |
|---|---|---|
rid |
str
|
Dataset RID (e.g., |
version |
str
|
Semantic version string (e.g., |
materialize |
bool
|
If False, download only table metadata, not asset files. |
description |
str
|
Human-readable description of the dataset's role in this config. |
exclude_tables |
list[str] | None
|
Optional table names to exclude from FK path traversal during bag export. |
timeout |
list[int] | None
|
Optional |
fetch_concurrency |
int
|
Number of concurrent fetch threads for asset download. |
Source code in src/deriva_ml/dataset/aux_classes.py
410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 | |
DatasetVersion
Bases: Version
A PEP 440 version associated with a dataset.
Released versions are written as "MAJOR.MINOR.PATCH" (e.g., "0.4.0").
Dev versions use the setuptools-scm-compatible post-release form
"<last_release>.post1.devN" (e.g., "0.4.0.post1.dev3") — they sort
after the last release and before the next, and are queryable via
:attr:is_devrelease. See ADR-0004 for the rationale behind PEP 440 over
semver pre-release suffixes.
The wire format for released versions is unchanged from the previous
semver-backed implementation: a string like "0.4.0" parses and
serialises identically.
Example
Construct from positional integers (release-segment form):
>>> v = DatasetVersion(0, 4, 0)
>>> str(v)
'0.4.0'
>>> v.is_devrelease
False
Construct from a string (any PEP 440 form):
>>> dev = DatasetVersion.parse("0.4.0.post1.dev3")
>>> dev.is_devrelease
True
>>> DatasetVersion(0, 4, 0) < dev < DatasetVersion(0, 5, 0)
True
Advance the release-segment for a release:
>>> from deriva_ml.dataset.aux_classes import VersionPart
>>> DatasetVersion(0, 4, 0).next_release(VersionPart.minor)
<DatasetVersion('0.5.0')>
Source code in src/deriva_ml/dataset/aux_classes.py
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | |
patch
property
patch: int
The patch component of the release segment.
packaging.Version exposes this as :attr:micro. patch is
kept on DatasetVersion because it matches the VersionPart
vocabulary and the column meaning.
__init__
__init__(
major: SupportsInt,
minor: SupportsInt = 0,
patch: SupportsInt = 0,
) -> None
Construct a released DatasetVersion from a release-segment tuple.
For PEP 440 forms beyond MAJOR.MINOR.PATCH (post-release, dev,
local, etc.), use :meth:parse with the canonical string form.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
major
|
SupportsInt
|
Major version number. Schema-altering changes. |
required |
minor
|
SupportsInt
|
Minor version number. Additive changes. |
0
|
patch
|
SupportsInt
|
Patch number. Small clean-ups and edits. |
0
|
Source code in src/deriva_ml/dataset/aux_classes.py
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | |
next_release
next_release(
bump: VersionPart,
) -> DatasetVersion
Return the next released DatasetVersion after this one.
Applies a release-segment bump to (major, minor, patch) and
discards any post-release / dev / local segments — the new value
is always a clean released version. Higher-order bumps reset
lower-order components to zero, matching the standard
major.minor.patch convention.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bump
|
VersionPart
|
Which part of the release segment to advance. |
required |
Returns:
| Type | Description |
|---|---|
DatasetVersion
|
A new released |
DatasetVersion
|
component advanced. |
Example
from deriva_ml.dataset.aux_classes import VersionPart DatasetVersion(0, 4, 0).next_release(VersionPart.minor)
DatasetVersion(0, 4, 7).next_release(VersionPart.major) DatasetVersion.parse("0.4.0.post1.dev3").next_release( ... VersionPart.minor ... )
Source code in src/deriva_ml/dataset/aux_classes.py
168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | |
parse
classmethod
parse(version: str) -> DatasetVersion
Parse a PEP 440 version string into a DatasetVersion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
version
|
str
|
A PEP 440 version string. Released forms like
|
required |
Returns:
| Type | Description |
|---|---|
DatasetVersion
|
A new |
Raises:
| Type | Description |
|---|---|
InvalidVersion
|
If version is not a valid PEP 440 version string. |
Example
str(DatasetVersion.parse("0.4.0")) '0.4.0' DatasetVersion.parse("0.4.0.post1.dev3").is_devrelease True
Source code in src/deriva_ml/dataset/aux_classes.py
139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 | |
to_dict
to_dict() -> dict[str, int]
Serialise the release segment as a {major, minor, patch} dict.
Used by :class:DatasetSpec's field serializer for hydra-zen
round-tripping. Pre-release / post-release / dev / local segments
are not preserved in this form — it represents only the
release-segment tuple. Use str(self) for a lossless serialisation.
Returns:
| Type | Description |
|---|---|
dict[str, int]
|
A dict with integer |
Example
DatasetVersion(1, 2, 3).to_dict() {'major': 1, 'minor': 2, 'patch': 3}
Source code in src/deriva_ml/dataset/aux_classes.py
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | |
VersionPart
Bases: StrEnum
Names the component of a dataset version to advance on release.
DerivaML uses a major.minor.patch release segment within the broader
PEP 440 version space (see ADR-0004). Picking a VersionPart selects
which component is incremented when a dev period is promoted to a
released version.
Inherits from :class:enum.StrEnum so members compare equal to their
raw string values. This matters because VALIDATION_CONFIG has
use_enum_values=True, so Pydantic's @validate_call coerces
members to their string values at the call boundary; the
match/case in :meth:DatasetVersion.next_release needs the
coerced string to still equal the enum member it's being compared to.
Attributes:
| Name | Type | Description |
|---|---|---|
major |
Schema-altering changes that break backward compatibility. |
|
minor |
Additive changes — new members, new feature values, new annotations. |
|
patch |
Small clean-ups and edits that don't change the dataset's shape. |
Source code in src/deriva_ml/dataset/aux_classes.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | |