datasets
datasets
¶
Dataset loading utilities for the CLI.
This module provides the DatasetInfo dataclass for loading datasets from URLs or file paths.
Classes:
| Name | Description |
|---|---|
DatasetInfo |
Entry in the dataset registry. |
DatasetRegistry |
Registry of datasets for easy reference by name. |
DatasetInfo
pydantic-model
¶
Bases: BaseModel
Entry in the dataset registry.
Fields:
-
name(str) -
url(str) -
overrides(dict[str, Any] | None) -
load_args(dict[str, Any] | None) -
_registry(DatasetRegistry | None)
name
pydantic-field
¶
Short name of the dataset. Used to fetch the dataset from the registry by name.
url
pydantic-field
¶
URL or path to the dataset. If a relative path, it is joined with the base_url from the registry if present.
overrides = None
pydantic-field
¶
Config overrides for this dataset. These overrides take precedence over the values from the config file in the CLI, but are themselves overridden by any CLI args specifying config parameters.
load_args = None
pydantic-field
¶
Extra arguments needed by the data reader for this dataset.
get_url()
¶
Get url for the dataset with base_url from registry added if appropriate.
If self.url is a relative path, it is joined with the base_url from the registry if present. Otherwise, self.url is returned as is.
Returns:
| Type | Description |
|---|---|
str
|
The realized url for the dataset. |
Source code in src/nemo_safe_synthesizer/cli/datasets.py
fetch()
¶
Fetch the dataset and return a pandas DataFrame.
Infers the file format from the URL extension and merges any load_args on top of per-format defaults.
Returns:
| Type | Description |
|---|---|
DataFrame
|
The dataset as a DataFrame. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the file extension is not supported. |
Source code in src/nemo_safe_synthesizer/cli/datasets.py
DatasetRegistry(**data)
pydantic-model
¶
Bases: BaseModel
Registry of datasets for easy reference by name.
Datasets can be looked up by name via get_dataset. If the name is
not in the registry, a new DatasetInfo is created on-the-fly treating
the name as a literal URL or path.
When constructed, the DatasetRegistry automatically adds back-references to each entry in self.datasets so the
DatasetInfo instances can resolve base_url.
Fields:
-
datasets(list[DatasetInfo]) -
base_url(str | None)
Source code in src/nemo_safe_synthesizer/cli/datasets.py
datasets
pydantic-field
¶
List of datasets in the registry.
base_url = None
pydantic-field
¶
Base URL for the registry. Any relative paths will be prepended with the base_url before attempting to load the dataset. This only applies to the datasets in the registry which have a relative url.
get_dataset(url)
¶
Look up a dataset by name, creating an ad-hoc entry if not found.
When url matches a registered name the corresponding entry is
returned. Otherwise a new DatasetInfo is created with the raw
url as both name and path (without a registry back-reference, so
relative paths resolve against the working directory).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
Dataset name, URL, or file path. |
required |
Returns:
| Type | Description |
|---|---|
DatasetInfo
|
Matching or newly created |
Source code in src/nemo_safe_synthesizer/cli/datasets.py
from_yaml(path)
classmethod
¶
Load a DatasetRegistry from a YAML file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to the YAML file. |
required |
Returns:
| Type | Description |
|---|---|
Self
|
Parsed registry with back-references set on each dataset. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If |