Skip to content

Dataset Auxiliary Classes

Supporting classes for dataset operations including version management, dataset specifications, and history tracking.

Auxiliary classes for dataset versioning, history, and configuration.

This module defines VersionPart, DatasetVersion, DatasetHistory, DatasetMinid, DatasetSpec, and DatasetSpecConfig -- the value objects used throughout DerivaML to represent dataset versions, provenance records, and hydra-zen configuration entries.

DatasetHistory

Bases: BaseModel

Class representing a dataset history.

Attributes:

Name Type Description
dataset_version DatasetVersion

A DatasetVersion object which captures the semantic versioning of the dataset.

dataset_rid RID

The RID of the dataset.

version_rid RID

The RID of the version record for the dataset in the Dataset_Version table.

minid str

The URL that represents the handle of the dataset bag. This will be None if a MINID has not been created yet.

snapshot str

Catalog snapshot ID of when the version record was created.

Source code in src/deriva_ml/dataset/aux_classes.py
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
class DatasetHistory(BaseModel):
    """
    Class representing a dataset history.

    Attributes:
        dataset_version (DatasetVersion): A DatasetVersion object which captures the semantic versioning of the dataset.
        dataset_rid (RID): The RID of the dataset.
        version_rid (RID): The RID of the version record for the dataset in the Dataset_Version table.
        minid (str): The URL that represents the handle of the dataset bag.  This will be None if a MINID has not
                     been created yet.
        snapshot (str): Catalog snapshot ID of when the version record was created.
    """

    dataset_version: DatasetVersion
    dataset_rid: RID
    version_rid: RID
    execution_rid: Optional[RID] = None
    description: str | None = ""
    minid: str | None = None
    spec_hash: str | None = None
    snapshot: str | None = None

    model_config = VALIDATION_CONFIG

    @field_validator("execution_rid", mode="before")
    @classmethod
    def _default_execution_rid(cls, v: str | None) -> str | None:
        return None if v == "" else v

    @field_validator("description", mode="after")
    def _default_description(cls, v: str | None) -> str:
        return v or ""

DatasetMinid

Bases: BaseModel

Represent information about a MINID that refers to a dataset

Attributes:

Name Type Description
dataset_version DatasetVersion

A DatasetVersion object which captures the semantic versioning of the dataset.

metadata dict

A dictionary containing metadata from the MINID landing page.

minid str

The URL that represents the handle of the MINID associated with the dataset.

bag_url str

The URL to the dataset bag

identifier str

The identifier of the MINID in CURI form

landing_page str

The URL to the landing page of the MINID

version_rid str

RID of the dataset version.

checksum str

The checksum of the MINID in SHA256 form

Source code in src/deriva_ml/dataset/aux_classes.py
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
class DatasetMinid(BaseModel):
    """Represent information about a MINID that refers to a dataset

    Attributes:
        dataset_version (DatasetVersion): A DatasetVersion object which captures the semantic versioning of the dataset.
        metadata (dict): A dictionary containing metadata from the MINID landing page.
        minid (str): The URL that represents the handle of the MINID associated with the dataset.
        bag_url (str): The URL to the dataset bag
        identifier (str): The identifier of the MINID in CURI form
        landing_page (str): The URL to the landing page of the MINID
        version_rid (str): RID of the dataset version.
        checksum (str): The checksum of the MINID in SHA256 form

    """

    dataset_version: DatasetVersion
    metadata: dict[str, str | int] = {}
    minid: str | None = Field(alias="compact_uri", default=None)
    bag_url: str = Field(alias="location")
    identifier: Optional[str] = None
    landing_page: Optional[str] = None
    version_rid: RID = Field(alias="RID")
    checksum: str = Field(alias="checksums", default="")

    @computed_field
    @property
    def dataset_rid(self) -> str:
        rid_parts = self.version_rid.split("@")
        return rid_parts[0]

    @computed_field
    @property
    def dataset_snapshot(self) -> str | None:
        # ``version_rid`` is ``{rid}`` or ``{rid}@{snapshot}`` per
        # the validation pattern. The unsnapped form has no
        # ``@`` segment; surface that as ``None`` rather than
        # IndexError-ing on the missing split component.
        parts = self.version_rid.split("@", 1)
        return parts[1] if len(parts) == 2 else None

    @model_validator(mode="before")
    @classmethod
    def insert_metadata(cls, data: dict) -> dict:
        if isinstance(data, dict):
            if "metadata" in data:
                data = data | data["metadata"]
        return data

    @field_validator("bag_url", mode="before")
    @classmethod
    def convert_location_to_str(cls, value: list[str] | str) -> str:
        return value[0] if isinstance(value, list) else value

    @field_validator("checksum", mode="before")
    @classmethod
    def convert_checksum_to_value(cls, checksums: list[dict]) -> str:
        checksum_value = ""
        for checksum in checksums:
            if checksum.get("function") == "sha256":
                checksum_value = checksum.get("value")
                break
        return checksum_value

    model_config = VALIDATION_CONFIG

DatasetSpec

Bases: BaseModel

Represent a dataset_table in an execution configuration dataset_table list

Attributes:

Name Type Description
rid RID

A dataset_table RID

materialize bool

If False do not materialize datasets, only download table data, no assets. Defaults to True

version DatasetVersion

The version of the dataset. Should follow semantic versioning.

exclude_tables set[str] | None

Optional set of table names to exclude from FK path traversal during bag export. Tables in this set will not be visited, pruning branches of the FK graph. Useful for avoiding query timeouts on large tables.

timeout tuple[int, int] | None

Optional (connect_timeout, read_timeout) in seconds for network requests during bag download. Defaults to (10, 610) if not specified. Increase read_timeout for large datasets with deep FK joins.

Source code in src/deriva_ml/dataset/aux_classes.py
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
class DatasetSpec(BaseModel):
    """Represent a dataset_table in an execution configuration dataset_table list

    Attributes:
        rid (RID): A dataset_table RID
        materialize (bool): If False do not materialize datasets, only download table data, no assets.  Defaults to True
        version (DatasetVersion): The version of the dataset.  Should follow semantic versioning.
        exclude_tables (set[str] | None): Optional set of table names to exclude from FK path
            traversal during bag export. Tables in this set will not be visited, pruning branches
            of the FK graph. Useful for avoiding query timeouts on large tables.
        timeout (tuple[int, int] | None): Optional (connect_timeout, read_timeout) in seconds
            for network requests during bag download. Defaults to (10, 610) if not specified.
            Increase read_timeout for large datasets with deep FK joins.
    """

    rid: RID
    version: DatasetVersion | conlist(item_type=int, min_length=3, max_length=3) | tuple[int, int, int] | str
    materialize: bool = True
    description: str = ""
    exclude_tables: set[str] | None = None
    timeout: tuple[int, int] | None = None
    fetch_concurrency: int = 8

    model_config = VALIDATION_CONFIG

    @field_validator("version", mode="before")
    @classmethod
    def version_field_validator(cls, v: Any) -> Any:
        if isinstance(v, dict):
            return DatasetVersion(**v)
        elif isinstance(v, str):
            return DatasetVersion.parse(v)
        elif (isinstance(v, list) or isinstance(v, tuple)) and len(v) == 3:
            return DatasetVersion(int(v[0]), int(v[1]), int(v[2]))
        else:
            return v

    @model_validator(mode="before")
    @classmethod
    def _check_bare_rid(cls, data: Any) -> dict[str, str | bool]:
        # If you are just given a string, assume it's a rid and put into dict for further validation.
        return {"rid": data} if isinstance(data, str) else data

    @field_serializer("version")
    def serialize_version(self, version: DatasetVersion) -> dict[str, Any]:
        return version.to_dict()

    @classmethod
    def from_shorthand(cls, s: str) -> "DatasetSpec":
        """Parse ``'RID@version'`` into a :class:`DatasetSpec`.

        Used by the :meth:`DerivaML.create_execution` kwargs form so
        callers can write ``datasets=["1-XYZ@1.0.0"]`` instead of
        instantiating a full ``DatasetSpec`` by hand. Accepts both
        ``'RID'`` (bare RID; version defaults to ``0.0.0``) and
        ``'RID@version'`` (semantic version string).

        Args:
            s: The shorthand string. Must contain at most one ``'@'``
                separator and a non-empty RID portion.

        Returns:
            A :class:`DatasetSpec` instance.

        Raises:
            ValueError: If the string is empty or contains more than
                one ``'@'`` separator.

        Example:
            Parse a shorthand with explicit version::

                >>> spec = DatasetSpec.from_shorthand("1-XYZ@2.0.0")
                >>> spec.rid
                '1-XYZ'
                >>> str(spec.version)
                '2.0.0'

            Parse a bare RID (version defaults to ``0.0.0``)::

                >>> spec = DatasetSpec.from_shorthand("1-XYZ")
                >>> spec.rid
                '1-XYZ'
        """
        if not s:
            raise ValueError("empty dataset shorthand")
        parts = s.split("@")
        if len(parts) == 1:
            return cls(rid=parts[0], version="0.0.0")
        if len(parts) == 2:
            return cls(rid=parts[0], version=parts[1])
        raise ValueError(f"dataset shorthand has too many '@' separators: {s!r}")

from_shorthand classmethod

from_shorthand(s: str) -> DatasetSpec

Parse 'RID@version' into a :class:DatasetSpec.

Used by the :meth:DerivaML.create_execution kwargs form so callers can write datasets=["1-XYZ@1.0.0"] instead of instantiating a full DatasetSpec by hand. Accepts both 'RID' (bare RID; version defaults to 0.0.0) and 'RID@version' (semantic version string).

Parameters:

Name Type Description Default
s str

The shorthand string. Must contain at most one '@' separator and a non-empty RID portion.

required

Returns:

Name Type Description
A DatasetSpec

class:DatasetSpec instance.

Raises:

Type Description
ValueError

If the string is empty or contains more than one '@' separator.

Example

Parse a shorthand with explicit version::

>>> spec = DatasetSpec.from_shorthand("1-XYZ@2.0.0")
>>> spec.rid
'1-XYZ'
>>> str(spec.version)
'2.0.0'

Parse a bare RID (version defaults to 0.0.0)::

>>> spec = DatasetSpec.from_shorthand("1-XYZ")
>>> spec.rid
'1-XYZ'
Source code in src/deriva_ml/dataset/aux_classes.py
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
@classmethod
def from_shorthand(cls, s: str) -> "DatasetSpec":
    """Parse ``'RID@version'`` into a :class:`DatasetSpec`.

    Used by the :meth:`DerivaML.create_execution` kwargs form so
    callers can write ``datasets=["1-XYZ@1.0.0"]`` instead of
    instantiating a full ``DatasetSpec`` by hand. Accepts both
    ``'RID'`` (bare RID; version defaults to ``0.0.0``) and
    ``'RID@version'`` (semantic version string).

    Args:
        s: The shorthand string. Must contain at most one ``'@'``
            separator and a non-empty RID portion.

    Returns:
        A :class:`DatasetSpec` instance.

    Raises:
        ValueError: If the string is empty or contains more than
            one ``'@'`` separator.

    Example:
        Parse a shorthand with explicit version::

            >>> spec = DatasetSpec.from_shorthand("1-XYZ@2.0.0")
            >>> spec.rid
            '1-XYZ'
            >>> str(spec.version)
            '2.0.0'

        Parse a bare RID (version defaults to ``0.0.0``)::

            >>> spec = DatasetSpec.from_shorthand("1-XYZ")
            >>> spec.rid
            '1-XYZ'
    """
    if not s:
        raise ValueError("empty dataset shorthand")
    parts = s.split("@")
    if len(parts) == 1:
        return cls(rid=parts[0], version="0.0.0")
    if len(parts) == 2:
        return cls(rid=parts[0], version=parts[1])
    raise ValueError(f"dataset shorthand has too many '@' separators: {s!r}")

DatasetSpecConfig

Hydra-zen configuration dataclass for :class:DatasetSpec.

Use this in hydra-zen store() calls and configuration modules to specify dataset inputs. When instantiated by hydra-zen, it produces a :class:DatasetSpec instance.

Attributes:

Name Type Description
rid str

Dataset RID (e.g., "28CT").

version str

Semantic version string (e.g., "0.21.0").

materialize bool

If False, download only table metadata, not asset files.

description str

Human-readable description of the dataset's role in this config.

exclude_tables list[str] | None

Optional table names to exclude from FK path traversal during bag export.

timeout list[int] | None

Optional [connect_timeout, read_timeout] in seconds for network requests during bag download.

fetch_concurrency int

Number of concurrent fetch threads for asset download.

Source code in src/deriva_ml/dataset/aux_classes.py
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
@hydrated_dataclass(DatasetSpec)
class DatasetSpecConfig:
    """Hydra-zen configuration dataclass for :class:`DatasetSpec`.

    Use this in hydra-zen ``store()`` calls and configuration modules to
    specify dataset inputs.  When instantiated by hydra-zen, it produces a
    :class:`DatasetSpec` instance.

    Attributes:
        rid: Dataset RID (e.g., ``"28CT"``).
        version: Semantic version string (e.g., ``"0.21.0"``).
        materialize: If False, download only table metadata, not asset files.
        description: Human-readable description of the dataset's role in this config.
        exclude_tables: Optional table names to exclude from FK path traversal
            during bag export.
        timeout: Optional ``[connect_timeout, read_timeout]`` in seconds for
            network requests during bag download.
        fetch_concurrency: Number of concurrent fetch threads for asset download.
    """

    rid: str
    version: str
    materialize: bool = True
    description: str = ""
    exclude_tables: list[str] | None = None
    timeout: list[int] | None = None
    fetch_concurrency: int = 8

DatasetVersion

Bases: Version

A PEP 440 version associated with a dataset.

Released versions are written as "MAJOR.MINOR.PATCH" (e.g., "0.4.0"). Dev versions use the setuptools-scm-compatible post-release form "<last_release>.post1.devN" (e.g., "0.4.0.post1.dev3") — they sort after the last release and before the next, and are queryable via :attr:is_devrelease. See ADR-0004 for the rationale behind PEP 440 over semver pre-release suffixes.

The wire format for released versions is unchanged from the previous semver-backed implementation: a string like "0.4.0" parses and serialises identically.

Example

Construct from positional integers (release-segment form):

>>> v = DatasetVersion(0, 4, 0)
>>> str(v)
'0.4.0'
>>> v.is_devrelease
False

Construct from a string (any PEP 440 form):

>>> dev = DatasetVersion.parse("0.4.0.post1.dev3")
>>> dev.is_devrelease
True
>>> DatasetVersion(0, 4, 0) < dev < DatasetVersion(0, 5, 0)
True

Advance the release-segment for a release:

>>> DatasetVersion(0, 4, 0).next_release(VersionPart.minor)
<Version('0.5.0')>
Source code in src/deriva_ml/dataset/aux_classes.py
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
class DatasetVersion(Version):
    """A PEP 440 version associated with a dataset.

    Released versions are written as ``"MAJOR.MINOR.PATCH"`` (e.g., ``"0.4.0"``).
    Dev versions use the ``setuptools-scm``-compatible post-release form
    ``"<last_release>.post1.devN"`` (e.g., ``"0.4.0.post1.dev3"``) — they sort
    *after* the last release and *before* the next, and are queryable via
    :attr:`is_devrelease`. See ADR-0004 for the rationale behind PEP 440 over
    semver pre-release suffixes.

    The wire format for released versions is unchanged from the previous
    semver-backed implementation: a string like ``"0.4.0"`` parses and
    serialises identically.

    Example:
        Construct from positional integers (release-segment form):

            >>> v = DatasetVersion(0, 4, 0)
            >>> str(v)
            '0.4.0'
            >>> v.is_devrelease
            False

        Construct from a string (any PEP 440 form):

            >>> dev = DatasetVersion.parse("0.4.0.post1.dev3")
            >>> dev.is_devrelease
            True
            >>> DatasetVersion(0, 4, 0) < dev < DatasetVersion(0, 5, 0)
            True

        Advance the release-segment for a release:

            >>> DatasetVersion(0, 4, 0).next_release(VersionPart.minor)
            <Version('0.5.0')>
    """

    def __init__(
        self,
        major: SupportsInt,
        minor: SupportsInt = 0,
        patch: SupportsInt = 0,
    ) -> None:
        """Construct a released ``DatasetVersion`` from a release-segment tuple.

        For PEP 440 forms beyond ``MAJOR.MINOR.PATCH`` (post-release, dev,
        local, etc.), use :meth:`parse` with the canonical string form.

        Args:
            major: Major version number. Schema-altering changes.
            minor: Minor version number. Additive changes.
            patch: Patch number. Small clean-ups and edits.
        """
        super().__init__(f"{int(major)}.{int(minor)}.{int(patch)}")

    @property
    def patch(self) -> int:
        """The patch component of the release segment.

        ``packaging.Version`` exposes this as :attr:`micro`. ``patch`` is
        kept on ``DatasetVersion`` because it matches the ``VersionPart``
        vocabulary and the column meaning.
        """
        return self.micro

    def to_dict(self) -> dict[str, int]:
        """Serialise the release segment as a ``{major, minor, patch}`` dict.

        Used by :class:`DatasetSpec`'s field serializer for hydra-zen
        round-tripping. Pre-release / post-release / dev / local segments
        are *not* preserved in this form — it represents only the
        release-segment tuple. Use ``str(self)`` for a lossless serialisation.

        Returns:
            A dict with integer ``major``, ``minor``, and ``patch`` fields.

        Example:
            >>> DatasetVersion(1, 2, 3).to_dict()
            {'major': 1, 'minor': 2, 'patch': 3}
        """
        return {"major": self.major, "minor": self.minor, "patch": self.patch}

    @classmethod
    def parse(cls, version: str) -> "DatasetVersion":
        """Parse a PEP 440 version string into a ``DatasetVersion``.

        Args:
            version: A PEP 440 version string. Released forms like
                ``"1.2.3"`` and dev forms like ``"1.2.3.post1.dev4"`` are
                both accepted.

        Returns:
            A new ``DatasetVersion`` corresponding to the parsed string.

        Raises:
            packaging.version.InvalidVersion: If *version* is not a valid
                PEP 440 version string.

        Example:
            >>> str(DatasetVersion.parse("0.4.0"))
            '0.4.0'
            >>> DatasetVersion.parse("0.4.0.post1.dev3").is_devrelease
            True
        """
        # __new__ on the parent does the parse; we just need to return a
        # subclass instance with the same internal state.
        v = Version(version)
        instance = cls.__new__(cls)
        Version.__init__(instance, str(v))
        return instance

    def next_release(self, bump: "VersionPart") -> "DatasetVersion":
        """Return the next released ``DatasetVersion`` after this one.

        Applies a release-segment bump to ``(major, minor, patch)`` and
        discards any post-release / dev / local segments — the new value
        is always a clean released version. Higher-order bumps reset
        lower-order components to zero, matching the standard
        ``major.minor.patch`` convention.

        Args:
            bump: Which part of the release segment to advance.

        Returns:
            A new released ``DatasetVersion`` with the requested
            component advanced.

        Example:
            >>> DatasetVersion(0, 4, 0).next_release(VersionPart.minor)
            <Version('0.5.0')>
            >>> DatasetVersion(0, 4, 7).next_release(VersionPart.major)
            <Version('1.0.0')>
            >>> DatasetVersion.parse("0.4.0.post1.dev3").next_release(
            ...     VersionPart.minor
            ... )
            <Version('0.5.0')>
        """
        match bump:
            case VersionPart.major:
                return DatasetVersion(self.major + 1, 0, 0)
            case VersionPart.minor:
                return DatasetVersion(self.major, self.minor + 1, 0)
            case VersionPart.patch:
                return DatasetVersion(self.major, self.minor, self.patch + 1)
            case _:  # pragma: no cover - defensive; VersionPart is closed
                raise ValueError(f"unknown VersionPart: {bump!r}")

patch property

patch: int

The patch component of the release segment.

packaging.Version exposes this as :attr:micro. patch is kept on DatasetVersion because it matches the VersionPart vocabulary and the column meaning.

__init__

__init__(
    major: SupportsInt,
    minor: SupportsInt = 0,
    patch: SupportsInt = 0,
) -> None

Construct a released DatasetVersion from a release-segment tuple.

For PEP 440 forms beyond MAJOR.MINOR.PATCH (post-release, dev, local, etc.), use :meth:parse with the canonical string form.

Parameters:

Name Type Description Default
major SupportsInt

Major version number. Schema-altering changes.

required
minor SupportsInt

Minor version number. Additive changes.

0
patch SupportsInt

Patch number. Small clean-ups and edits.

0
Source code in src/deriva_ml/dataset/aux_classes.py
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
def __init__(
    self,
    major: SupportsInt,
    minor: SupportsInt = 0,
    patch: SupportsInt = 0,
) -> None:
    """Construct a released ``DatasetVersion`` from a release-segment tuple.

    For PEP 440 forms beyond ``MAJOR.MINOR.PATCH`` (post-release, dev,
    local, etc.), use :meth:`parse` with the canonical string form.

    Args:
        major: Major version number. Schema-altering changes.
        minor: Minor version number. Additive changes.
        patch: Patch number. Small clean-ups and edits.
    """
    super().__init__(f"{int(major)}.{int(minor)}.{int(patch)}")

next_release

next_release(
    bump: VersionPart,
) -> DatasetVersion

Return the next released DatasetVersion after this one.

Applies a release-segment bump to (major, minor, patch) and discards any post-release / dev / local segments — the new value is always a clean released version. Higher-order bumps reset lower-order components to zero, matching the standard major.minor.patch convention.

Parameters:

Name Type Description Default
bump VersionPart

Which part of the release segment to advance.

required

Returns:

Type Description
DatasetVersion

A new released DatasetVersion with the requested

DatasetVersion

component advanced.

Example

DatasetVersion(0, 4, 0).next_release(VersionPart.minor) DatasetVersion(0, 4, 7).next_release(VersionPart.major) DatasetVersion.parse("0.4.0.post1.dev3").next_release( ... VersionPart.minor ... )

Source code in src/deriva_ml/dataset/aux_classes.py
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
def next_release(self, bump: "VersionPart") -> "DatasetVersion":
    """Return the next released ``DatasetVersion`` after this one.

    Applies a release-segment bump to ``(major, minor, patch)`` and
    discards any post-release / dev / local segments — the new value
    is always a clean released version. Higher-order bumps reset
    lower-order components to zero, matching the standard
    ``major.minor.patch`` convention.

    Args:
        bump: Which part of the release segment to advance.

    Returns:
        A new released ``DatasetVersion`` with the requested
        component advanced.

    Example:
        >>> DatasetVersion(0, 4, 0).next_release(VersionPart.minor)
        <Version('0.5.0')>
        >>> DatasetVersion(0, 4, 7).next_release(VersionPart.major)
        <Version('1.0.0')>
        >>> DatasetVersion.parse("0.4.0.post1.dev3").next_release(
        ...     VersionPart.minor
        ... )
        <Version('0.5.0')>
    """
    match bump:
        case VersionPart.major:
            return DatasetVersion(self.major + 1, 0, 0)
        case VersionPart.minor:
            return DatasetVersion(self.major, self.minor + 1, 0)
        case VersionPart.patch:
            return DatasetVersion(self.major, self.minor, self.patch + 1)
        case _:  # pragma: no cover - defensive; VersionPart is closed
            raise ValueError(f"unknown VersionPart: {bump!r}")

parse classmethod

parse(version: str) -> DatasetVersion

Parse a PEP 440 version string into a DatasetVersion.

Parameters:

Name Type Description Default
version str

A PEP 440 version string. Released forms like "1.2.3" and dev forms like "1.2.3.post1.dev4" are both accepted.

required

Returns:

Type Description
DatasetVersion

A new DatasetVersion corresponding to the parsed string.

Raises:

Type Description
InvalidVersion

If version is not a valid PEP 440 version string.

Example

str(DatasetVersion.parse("0.4.0")) '0.4.0' DatasetVersion.parse("0.4.0.post1.dev3").is_devrelease True

Source code in src/deriva_ml/dataset/aux_classes.py
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
@classmethod
def parse(cls, version: str) -> "DatasetVersion":
    """Parse a PEP 440 version string into a ``DatasetVersion``.

    Args:
        version: A PEP 440 version string. Released forms like
            ``"1.2.3"`` and dev forms like ``"1.2.3.post1.dev4"`` are
            both accepted.

    Returns:
        A new ``DatasetVersion`` corresponding to the parsed string.

    Raises:
        packaging.version.InvalidVersion: If *version* is not a valid
            PEP 440 version string.

    Example:
        >>> str(DatasetVersion.parse("0.4.0"))
        '0.4.0'
        >>> DatasetVersion.parse("0.4.0.post1.dev3").is_devrelease
        True
    """
    # __new__ on the parent does the parse; we just need to return a
    # subclass instance with the same internal state.
    v = Version(version)
    instance = cls.__new__(cls)
    Version.__init__(instance, str(v))
    return instance

to_dict

to_dict() -> dict[str, int]

Serialise the release segment as a {major, minor, patch} dict.

Used by :class:DatasetSpec's field serializer for hydra-zen round-tripping. Pre-release / post-release / dev / local segments are not preserved in this form — it represents only the release-segment tuple. Use str(self) for a lossless serialisation.

Returns:

Type Description
dict[str, int]

A dict with integer major, minor, and patch fields.

Example

DatasetVersion(1, 2, 3).to_dict() {'major': 1, 'minor': 2, 'patch': 3}

Source code in src/deriva_ml/dataset/aux_classes.py
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
def to_dict(self) -> dict[str, int]:
    """Serialise the release segment as a ``{major, minor, patch}`` dict.

    Used by :class:`DatasetSpec`'s field serializer for hydra-zen
    round-tripping. Pre-release / post-release / dev / local segments
    are *not* preserved in this form — it represents only the
    release-segment tuple. Use ``str(self)`` for a lossless serialisation.

    Returns:
        A dict with integer ``major``, ``minor``, and ``patch`` fields.

    Example:
        >>> DatasetVersion(1, 2, 3).to_dict()
        {'major': 1, 'minor': 2, 'patch': 3}
    """
    return {"major": self.major, "minor": self.minor, "patch": self.patch}

VersionPart

Bases: Enum

Names the component of a dataset version to advance on release.

DerivaML uses a major.minor.patch release segment within the broader PEP 440 version space (see ADR-0004). Picking a VersionPart selects which component is incremented when a dev period is promoted to a released version.

Attributes:

Name Type Description
major

Schema-altering changes that break backward compatibility.

minor

Additive changes — new members, new feature values, new annotations.

patch

Small clean-ups and edits that don't change the dataset's shape.

Source code in src/deriva_ml/dataset/aux_classes.py
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
class VersionPart(Enum):
    """Names the component of a dataset version to advance on release.

    DerivaML uses a ``major.minor.patch`` release segment within the broader
    PEP 440 version space (see ADR-0004). Picking a ``VersionPart`` selects
    which component is incremented when a dev period is promoted to a
    released version.

    Attributes:
        major: Schema-altering changes that break backward compatibility.
        minor: Additive changes — new members, new feature values, new
            annotations.
        patch: Small clean-ups and edits that don't change the dataset's
            shape.
    """

    major = "major"
    minor = "minor"
    patch = "patch"