Skip to content

datasets

datasets

Dataset loading utilities for the CLI.

This module provides the DatasetInfo dataclass for loading datasets from URLs or file paths.

Classes:

Name Description
DatasetInfo

Entry in the dataset registry.

DatasetRegistry

Registry of datasets for easy reference by name.

DatasetInfo pydantic-model

Bases: BaseModel

Entry in the dataset registry.

Fields:

name pydantic-field

Short name of the dataset. Used to fetch the dataset from the registry by name.

url pydantic-field

URL or path to the dataset. If a relative path, it is joined with the base_url from the registry if present.

overrides = None pydantic-field

Config overrides for this dataset. These overrides take precedence over the values from the config file in the CLI, but are themselves overridden by any CLI args specifying config parameters.

load_args = None pydantic-field

Extra arguments needed by the data reader for this dataset.

get_url()

Get url for the dataset with base_url from registry added if appropriate.

If self.url is a relative path, it is joined with the base_url from the registry if present. Otherwise, self.url is returned as is.

Returns:

Type Description
str

The realized url for the dataset.

Source code in src/nemo_safe_synthesizer/cli/datasets.py
def get_url(self) -> str:
    """Get url for the dataset with base_url from registry added if appropriate.

    If self.url is a relative path, it is joined with the base_url from the registry if present.
    Otherwise, self.url is returned as is.

    Returns:
        The realized url for the dataset.
    """
    # pathlib.Path will collapse double slashes, so we need to check for
    # http(s) urls explicitly and not always use Path. For this reason we
    # also return a string instead of a Path object.
    if self.url.startswith(("http://", "https://")):
        # URLs are absolute, so we return them as is.
        return self.url

    url = Path(self.url)
    if self._registry and self._registry.base_url and not url.is_absolute():
        if self._registry.base_url.startswith(("http://", "https://")):
            return self._registry.base_url.rstrip("/") + "/" + str(url)
        else:
            return str(Path(self._registry.base_url) / url)

    return self.url

fetch()

Fetch the dataset and return a pandas DataFrame.

Infers the file format from the URL extension and merges any load_args on top of per-format defaults.

Returns:

Type Description
DataFrame

The dataset as a DataFrame.

Raises:

Type Description
ValueError

If the file extension is not supported.

Source code in src/nemo_safe_synthesizer/cli/datasets.py
def fetch(self) -> pd.DataFrame:
    """Fetch the dataset and return a pandas DataFrame.

    Infers the file format from the URL extension and merges any ``load_args`` on top of per-format defaults.

    Returns:
        The dataset as a DataFrame.

    Raises:
        ValueError: If the file extension is not supported.
    """
    url = self.get_url()

    logger.info(f"Reading dataset from {url}")

    # Determine the file extension and appropriate reader
    match Path(url).suffix.lstrip("."):
        case "csv" | "txt":
            reader = pd.read_csv
            default_load_args: dict[str, Any] = {}
        case "json":
            reader = pd.read_json
            default_load_args = {}
        case "jsonl":
            reader = pd.read_json
            default_load_args = {"lines": True}
        case "parquet":
            reader = pd.read_parquet
            default_load_args = {}
        case extension:
            if not extension:
                extension = f"<no extension found on url '{url}'>"
            raise ValueError(f"Unsupported file extension: {extension}")

    # Merge load args: user-provided args override defaults
    final_load_args = {**default_load_args, **(self.load_args or {})}

    try:
        return reader(url, **final_load_args)
    except Exception as e:
        logger.error(f"Error reading dataset from {url}: {e}", exc_info=True)
        raise

DatasetRegistry(**data) pydantic-model

Bases: BaseModel

Registry of datasets for easy reference by name.

Datasets can be looked up by name via get_dataset. If the name is not in the registry, a new DatasetInfo is created on-the-fly treating the name as a literal URL or path.

When constructed, the DatasetRegistry automatically adds back-references to each entry in self.datasets so the DatasetInfo instances can resolve base_url.

Fields:

Source code in src/nemo_safe_synthesizer/cli/datasets.py
def __init__(self, **data):
    super().__init__(**data)
    for dataset in self.datasets:
        dataset._registry = self

datasets pydantic-field

List of datasets in the registry.

base_url = None pydantic-field

Base URL for the registry. Any relative paths will be prepended with the base_url before attempting to load the dataset. This only applies to the datasets in the registry which have a relative url.

get_dataset(url)

Look up a dataset by name, creating an ad-hoc entry if not found.

When url matches a registered name the corresponding entry is returned. Otherwise a new DatasetInfo is created with the raw url as both name and path (without a registry back-reference, so relative paths resolve against the working directory).

Parameters:

Name Type Description Default
url str

Dataset name, URL, or file path.

required

Returns:

Type Description
DatasetInfo

Matching or newly created DatasetInfo.

Source code in src/nemo_safe_synthesizer/cli/datasets.py
def get_dataset(self, url: str) -> DatasetInfo:
    """Look up a dataset by name, creating an ad-hoc entry if not found.

    When ``url`` matches a registered name the corresponding entry is
    returned. Otherwise a new ``DatasetInfo`` is created with the raw
    ``url`` as both name and path (without a registry back-reference, so
    relative paths resolve against the working directory).

    Args:
        url: Dataset name, URL, or file path.

    Returns:
        Matching or newly created ``DatasetInfo``.
    """
    for dataset in self.datasets:
        if dataset.name == url:
            return dataset

    # If the dataset is not already in the registry, create and add a new
    # DatasetInfo object to the registry. Deliberately do NOT set the
    # registry reference for this new dataset, this ensures its relative
    # paths resolve against the current working directory, not the
    # registry's base_url (if set).
    new_dataset = DatasetInfo(name=url, url=url, overrides=None, load_args=None)
    self.datasets.append(new_dataset)
    return new_dataset

from_yaml(path) classmethod

Load a DatasetRegistry from a YAML file.

Parameters:

Name Type Description Default
path str | Path

Path to the YAML file.

required

Returns:

Type Description
Self

Parsed registry with back-references set on each dataset.

Raises:

Type Description
FileNotFoundError

If path does not exist.

Source code in src/nemo_safe_synthesizer/cli/datasets.py
@classmethod
def from_yaml(cls, path: str | Path) -> Self:
    """Load a ``DatasetRegistry`` from a YAML file.

    Args:
        path: Path to the YAML file.

    Returns:
        Parsed registry with back-references set on each dataset.

    Raises:
        FileNotFoundError: If ``path`` does not exist.
    """
    if not Path(path).exists():
        raise FileNotFoundError(f"File {path} does not exist")
    with open(path, "r") as f:
        data = yaml.safe_load(f)
    return cls.model_validate(data)