Dataset Auxiliary Classes
Supporting classes for dataset operations including version management, dataset specifications, and history tracking.
Auxiliary classes for dataset versioning, history, and configuration.
This module defines VersionPart, DatasetVersion, DatasetHistory, DatasetMinid, DatasetSpec, and DatasetSpecConfig -- the value objects used throughout DerivaML to represent dataset versions, provenance records, and hydra-zen configuration entries.
DatasetHistory
Bases: BaseModel
Class representing a dataset history.
Attributes:
| Name | Type | Description |
|---|---|---|
dataset_version |
DatasetVersion
|
A DatasetVersion object which captures the semantic versioning of the dataset. |
dataset_rid |
RID
|
The RID of the dataset. |
version_rid |
RID
|
The RID of the version record for the dataset in the Dataset_Version table. |
minid |
str
|
The URL that represents the handle of the dataset bag. This will be None if a MINID has not been created yet. |
snapshot |
str
|
Catalog snapshot ID of when the version record was created. |
Source code in src/deriva_ml/dataset/aux_classes.py
197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 | |
DatasetMinid
Bases: BaseModel
Represent information about a MINID that refers to a dataset
Attributes:
| Name | Type | Description |
|---|---|---|
dataset_version |
DatasetVersion
|
A DatasetVersion object which captures the semantic versioning of the dataset. |
metadata |
dict
|
A dictionary containing metadata from the MINID landing page. |
minid |
str
|
The URL that represents the handle of the MINID associated with the dataset. |
bag_url |
str
|
The URL to the dataset bag |
identifier |
str
|
The identifier of the MINID in CURI form |
landing_page |
str
|
The URL to the landing page of the MINID |
version_rid |
str
|
RID of the dataset version. |
checksum |
str
|
The checksum of the MINID in SHA256 form |
Source code in src/deriva_ml/dataset/aux_classes.py
231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 | |
DatasetSpec
Bases: BaseModel
Represent a dataset_table in an execution configuration dataset_table list
Attributes:
| Name | Type | Description |
|---|---|---|
rid |
RID
|
A dataset_table RID |
materialize |
bool
|
If False do not materialize datasets, only download table data, no assets. Defaults to True |
version |
DatasetVersion
|
The version of the dataset. Should follow semantic versioning. |
exclude_tables |
set[str] | None
|
Optional set of table names to exclude from FK path traversal during bag export. Tables in this set will not be visited, pruning branches of the FK graph. Useful for avoiding query timeouts on large tables. |
timeout |
tuple[int, int] | None
|
Optional (connect_timeout, read_timeout) in seconds for network requests during bag download. Defaults to (10, 610) if not specified. Increase read_timeout for large datasets with deep FK joins. |
Source code in src/deriva_ml/dataset/aux_classes.py
297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 | |
from_shorthand
classmethod
from_shorthand(s: str) -> DatasetSpec
Parse 'RID@version' into a :class:DatasetSpec.
Used by the :meth:DerivaML.create_execution kwargs form so
callers can write datasets=["1-XYZ@1.0.0"] instead of
instantiating a full DatasetSpec by hand. Accepts both
'RID' (bare RID; version defaults to 0.0.0) and
'RID@version' (semantic version string).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
s
|
str
|
The shorthand string. Must contain at most one |
required |
Returns:
| Name | Type | Description |
|---|---|---|
A |
DatasetSpec
|
class: |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the string is empty or contains more than
one |
Example
Parse a shorthand with explicit version::
>>> spec = DatasetSpec.from_shorthand("1-XYZ@2.0.0")
>>> spec.rid
'1-XYZ'
>>> str(spec.version)
'2.0.0'
Parse a bare RID (version defaults to 0.0.0)::
>>> spec = DatasetSpec.from_shorthand("1-XYZ")
>>> spec.rid
'1-XYZ'
Source code in src/deriva_ml/dataset/aux_classes.py
344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 | |
DatasetSpecConfig
Hydra-zen configuration dataclass for :class:DatasetSpec.
Use this in hydra-zen store() calls and configuration modules to
specify dataset inputs. When instantiated by hydra-zen, it produces a
:class:DatasetSpec instance.
Attributes:
| Name | Type | Description |
|---|---|---|
rid |
str
|
Dataset RID (e.g., |
version |
str
|
Semantic version string (e.g., |
materialize |
bool
|
If False, download only table metadata, not asset files. |
description |
str
|
Human-readable description of the dataset's role in this config. |
exclude_tables |
list[str] | None
|
Optional table names to exclude from FK path traversal during bag export. |
timeout |
list[int] | None
|
Optional |
fetch_concurrency |
int
|
Number of concurrent fetch threads for asset download. |
Source code in src/deriva_ml/dataset/aux_classes.py
391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 | |
DatasetVersion
Bases: Version
A PEP 440 version associated with a dataset.
Released versions are written as "MAJOR.MINOR.PATCH" (e.g., "0.4.0").
Dev versions use the setuptools-scm-compatible post-release form
"<last_release>.post1.devN" (e.g., "0.4.0.post1.dev3") — they sort
after the last release and before the next, and are queryable via
:attr:is_devrelease. See ADR-0004 for the rationale behind PEP 440 over
semver pre-release suffixes.
The wire format for released versions is unchanged from the previous
semver-backed implementation: a string like "0.4.0" parses and
serialises identically.
Example
Construct from positional integers (release-segment form):
>>> v = DatasetVersion(0, 4, 0)
>>> str(v)
'0.4.0'
>>> v.is_devrelease
False
Construct from a string (any PEP 440 form):
>>> dev = DatasetVersion.parse("0.4.0.post1.dev3")
>>> dev.is_devrelease
True
>>> DatasetVersion(0, 4, 0) < dev < DatasetVersion(0, 5, 0)
True
Advance the release-segment for a release:
>>> DatasetVersion(0, 4, 0).next_release(VersionPart.minor)
<Version('0.5.0')>
Source code in src/deriva_ml/dataset/aux_classes.py
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 | |
patch
property
patch: int
The patch component of the release segment.
packaging.Version exposes this as :attr:micro. patch is
kept on DatasetVersion because it matches the VersionPart
vocabulary and the column meaning.
__init__
__init__(
major: SupportsInt,
minor: SupportsInt = 0,
patch: SupportsInt = 0,
) -> None
Construct a released DatasetVersion from a release-segment tuple.
For PEP 440 forms beyond MAJOR.MINOR.PATCH (post-release, dev,
local, etc.), use :meth:parse with the canonical string form.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
major
|
SupportsInt
|
Major version number. Schema-altering changes. |
required |
minor
|
SupportsInt
|
Minor version number. Additive changes. |
0
|
patch
|
SupportsInt
|
Patch number. Small clean-ups and edits. |
0
|
Source code in src/deriva_ml/dataset/aux_classes.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | |
next_release
next_release(
bump: VersionPart,
) -> DatasetVersion
Return the next released DatasetVersion after this one.
Applies a release-segment bump to (major, minor, patch) and
discards any post-release / dev / local segments — the new value
is always a clean released version. Higher-order bumps reset
lower-order components to zero, matching the standard
major.minor.patch convention.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bump
|
VersionPart
|
Which part of the release segment to advance. |
required |
Returns:
| Type | Description |
|---|---|
DatasetVersion
|
A new released |
DatasetVersion
|
component advanced. |
Example
DatasetVersion(0, 4, 0).next_release(VersionPart.minor)
DatasetVersion(0, 4, 7).next_release(VersionPart.major) DatasetVersion.parse("0.4.0.post1.dev3").next_release( ... VersionPart.minor ... )
Source code in src/deriva_ml/dataset/aux_classes.py
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 | |
parse
classmethod
parse(version: str) -> DatasetVersion
Parse a PEP 440 version string into a DatasetVersion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
version
|
str
|
A PEP 440 version string. Released forms like
|
required |
Returns:
| Type | Description |
|---|---|
DatasetVersion
|
A new |
Raises:
| Type | Description |
|---|---|
InvalidVersion
|
If version is not a valid PEP 440 version string. |
Example
str(DatasetVersion.parse("0.4.0")) '0.4.0' DatasetVersion.parse("0.4.0.post1.dev3").is_devrelease True
Source code in src/deriva_ml/dataset/aux_classes.py
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 | |
to_dict
to_dict() -> dict[str, int]
Serialise the release segment as a {major, minor, patch} dict.
Used by :class:DatasetSpec's field serializer for hydra-zen
round-tripping. Pre-release / post-release / dev / local segments
are not preserved in this form — it represents only the
release-segment tuple. Use str(self) for a lossless serialisation.
Returns:
| Type | Description |
|---|---|
dict[str, int]
|
A dict with integer |
Example
DatasetVersion(1, 2, 3).to_dict() {'major': 1, 'minor': 2, 'patch': 3}
Source code in src/deriva_ml/dataset/aux_classes.py
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 | |
VersionPart
Bases: Enum
Names the component of a dataset version to advance on release.
DerivaML uses a major.minor.patch release segment within the broader
PEP 440 version space (see ADR-0004). Picking a VersionPart selects
which component is incremented when a dev period is promoted to a
released version.
Attributes:
| Name | Type | Description |
|---|---|---|
major |
Schema-altering changes that break backward compatibility. |
|
minor |
Additive changes — new members, new feature values, new annotations. |
|
patch |
Small clean-ups and edits that don't change the dataset's shape. |
Source code in src/deriva_ml/dataset/aux_classes.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | |