artifact_structure

`artifact_structure` ¶

Artifact directory structure for Safe Synthesizer.

Defines the on-disk layout produced by each pipeline run using a declarative descriptor pattern. FileNode and DirNode descriptors declare the tree shape on Workdir; at runtime they resolve to Path and BoundDir objects respectively, giving typed access to every artifact path without hard-coding strings throughout the CLI.

Typical directory tree:

<base_path>/<config>---<dataset>/<run_name>/
- train/  ...
- generate/  ...
- dataset/  ...

See Workdir for the full structure.

Classes:

Name	Description
`RunName`	Run name for artifact directories.
`FileNode`	Descriptor for file paths within a directory structure.
`DirNode`	Descriptor for directory paths within a directory structure.
`BoundDir`	Runtime class representing a bound directory path.
`Workdir`	Working directory structure for Safe Synthesizer artifacts.

Attributes:

Name	Type	Description
`RUN_NAME_DATE_FORMAT`		Format string for auto-generated timestamp-based run names.
`PROJECT_NAME_DELIMITER`		Delimiter used to separate config_name and dataset_name in project names.

`RUN_NAME_DATE_FORMAT = '%Y-%m-%dT%H:%M:%S'` `module-attribute` ¶

Format string for auto-generated timestamp-based run names.

`PROJECT_NAME_DELIMITER = '---'` `module-attribute` ¶

Delimiter used to separate config_name and dataset_name in project names.

Uses triple-dash to avoid ambiguity with single dashes that commonly appear in config and dataset filenames (e.g., my-config.yaml, training-data.csv).

`RunName(_value='', _timestamp=None)` `dataclass` ¶

Run name for artifact directories.

Supports two modes: 1) Auto-generated based on timestamp or 2) an arbitrary string name provided by the user (from --run-path).

Examples:

Auto-generated: "2026-01-15T12:00:00"
Explicit: "unsloth_adult_0", "my-experiment-run"

Methods:

Name	Description
`to_string`	Convert the run name to a string for use in directory names.
`from_string`	Parse a run name string into a RunName object.

Attributes:

Name	Type	Description
`is_timestamp_based`	`bool`	Whether this run name was generated from or parsed as a timestamp.
`timestamp`	`datetime \| None`	Parsed timestamp, or None for non-timestamp-based run names.

`is_timestamp_based` `property` ¶

Whether this run name was generated from or parsed as a timestamp.

`timestamp` `property` ¶

Parsed timestamp, or None for non-timestamp-based run names.

`to_string()` ¶

Convert the run name to a string for use in directory names.

Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py

def to_string(self) -> str:
    """Convert the run name to a string for use in directory names."""
    return self._value

`from_string(name)` `classmethod` ¶

Parse a run name string into a RunName object.

Accepts any valid string. If the string matches the timestamp format, the timestamp is also stored for potential use.

Parameters:

Name	Type	Description	Default
`name`	`str`	Run name string (e.g., "2026-01-15T12:00:00" or "unsloth_adult_0").	required

Returns:

Type	Description
`Self`	RunName with the provided name and optional parsed timestamp.

Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py

@classmethod
def from_string(cls, name: str) -> Self:
    """Parse a run name string into a RunName object.

    Accepts any valid string. If the string matches the timestamp format,
    the timestamp is also stored for potential use.

    Args:
        name: Run name string (e.g., "2026-01-15T12:00:00" or "unsloth_adult_0").

    Returns:
        RunName with the provided name and optional parsed timestamp.
    """
    ts = _try_parse_timestamp(name)
    return cls(_value=name, _timestamp=ts)

`FileNode(name)` ¶

Descriptor for file paths within a directory structure.

When accessed on a class, returns the descriptor itself. When accessed on an instance, returns the full Path to the file.

Parameters:

Name	Type	Description	Default
`name`	`str`	The filename (e.g., "config.json").	required

Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py

def __init__(self, name: str):
    self.name = name
    self._attr_name: str | None = None

`DirNode(name, **children)` ¶

Descriptor for directory paths within a directory structure.

Supports nested children (both FileNode and DirNode). When accessed on a class, returns the descriptor itself. When accessed on an instance, returns a BoundDir with the resolved path.

Parameters:

Name	Type	Description	Default
`name`	`str`	The directory name (e.g., "train").	required
`**children`	`FileNode \| DirNode`	Child nodes (FileNode or DirNode instances).	`{}`

Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py

def __init__(self, name: str, **children: FileNode | DirNode):
    self.name = name
    self.children: dict[str, FileNode | DirNode] = children
    self._attr_name: str | None = None

`BoundDir(path, children)` ¶

Bases: PathLike[str]

Runtime class representing a bound directory path.

Provides access to child FileNode and DirNode descriptors as attributes. Implements os.PathLike[str] so instances can be used wherever paths are expected.

Parameters:

Name	Type	Description	Default
`path`	`Path`	The resolved directory path.	required
`children`	`dict[str, FileNode \| DirNode]`	Child nodes from the DirNode.	required

Attributes:

Name	Type	Description
`path`	`Path`	The resolved directory path.

Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py

def __init__(self, path: Path, children: dict[str, FileNode | DirNode]):
    self._path = path
    self._children = children

`path` `property` ¶

The resolved directory path.

`Workdir(base_path, config_name, dataset_name, run_name=None, _run_name_obj=RunName(), _current_phase='unknown', _parent_workdir=None, _explicit_run_path=None)` `dataclass` ¶

Working directory structure for Safe Synthesizer artifacts.

This class defines the complete directory layout and provides typed access to all paths within the structure. It uses FileNode and DirNode descriptors for declarative path definitions.

Full directory structure:

<base_path>/<config>---<dataset>/<run_name>/
- train/
  - safe-synthesizer-config.json
  - cache/
  - adapter/                     (trained PEFT adapter)
    - adapter_config.json
    - adapter_model.safetensors
    - metadata_v2.json
    - dataset_schema.json
- generate/
  - logs.jsonl                   (generate-only workflow)
  - info.json                    (generate-only workflow)
  - synthetic_data.csv
  - evaluation_report.html
  - evaluation_metrics.json      (machine-readable metrics)
- dataset/
  - training.csv
  - test.csv
  - validation.csv               (when training.validation_ratio > 0)
  - transformed_training.csv     (when PII replacement transforms the data)
- logs/
  - <phase>.jsonl                (e.g. end_to_end.jsonl or train.jsonl)

Methods:

Name	Description
`phase_dir`	Get the phase directory path.
`ensure_directories`	Create directories based on the current phase.
`new_generation_run`	Create a new Workdir for a generation run from this workdir.
`from_explicit_run_path`	Create Workdir from an explicit run path (no auto-generated nesting).
`from_path`	Load a Workdir from an existing path.

Attributes:

Name	Type	Description
`base_path`	`Path`	Root directory under which project and run directories are created.
`config_name`	`str`	Stem of the config file name, used in the project directory name.
`dataset_name`	`str`	Stem of the dataset file name, used in the project directory name.
`run_name`	`str \| None`	Run name (auto-generated timestamp or explicit name from CLI).
`config`		Location for NSS config file.
`wandb_run_id_file`		Location for WandB run ID file.
`train`		Location and contents of train directory structure.
`generate`		Location and contents of generate directory structure.
`dataset`		Location and contents of dataset directory structure.
`project_name`	`str`	Project name in `<config>---<dataset>` format.
`project_dir`	`Path`	Project directory path (`<base_path>/<config>---<dataset>/`).
`run_dir`	`Path`	Run directory path (`<base_path>/<config>---<dataset>/<run_name>/`).
`log_file`	`Path`	Log file path for the current phase.
`adapter_path`	`Path`	Shortcut to train.adapter.path (adapter directory).
`metadata_file`	`Path`	Shortcut to train.adapter.metadata.
`schema_file`	`Path`	Shortcut to train.adapter.schema.
`dataset_schema_file`	`Path`	Alias for schema_file (backwards compatibility).
`output_file`	`Path`	Shortcut to generate.output.
`evaluation_report`	`Path`	Shortcut to generate.report.
`evaluation_metrics`	`Path`	Shortcut to generate.evaluation_metrics.
`source_run_dir`	`Path`	Source run directory (parent's `run_dir` for child generation runs).
`source_config`	`Path`	Source config file path (from parent workdir if available).
`source_adapter_path`	`Path`	Source adapter path (from parent workdir if available).
`source_dataset`	`BoundDir`	Source dataset directory (from parent workdir if available).
`source_schema_file`	`Path`	Source schema file path (from parent workdir if available).

`base_path` `instance-attribute` ¶

Root directory under which project and run directories are created.

`config_name` `instance-attribute` ¶

Stem of the config file name, used in the project directory name.

`dataset_name` `instance-attribute` ¶

Stem of the dataset file name, used in the project directory name.

`run_name = None` `class-attribute` `instance-attribute` ¶

Run name (auto-generated timestamp or explicit name from CLI).

When None, a timestamp-based name is generated in __post_init__.

`config = FileNode('safe-synthesizer-config.json')` `class-attribute` `instance-attribute` ¶

Location for NSS config file.

`wandb_run_id_file = FileNode('wandb_run_id.txt')` `class-attribute` `instance-attribute` ¶

Location for WandB run ID file.

`train = DirNode('train', config=(FileNode('safe-synthesizer-config.json')), cache=(DirNode('cache')), adapter=(DirNode('adapter', adapter_config=(FileNode('adapter_config.json')), metadata=(FileNode('metadata_v2.json')), schema=(FileNode('dataset_schema.json')))))` `class-attribute` `instance-attribute` ¶

Location and contents of train directory structure.

`generate = DirNode('generate', logs=(FileNode('logs.jsonl')), output=(FileNode('synthetic_data.csv')), report=(FileNode('evaluation_report.html')), evaluation_metrics=(FileNode('evaluation_metrics.json')), info=(FileNode('info.json')))` `class-attribute` `instance-attribute` ¶

Location and contents of generate directory structure.

`dataset = DirNode('dataset', training=(FileNode('training.csv')), test=(FileNode('test.csv')), validation=(FileNode('validation.csv')), transformed_training=(FileNode('transformed_training.csv')))` `class-attribute` `instance-attribute` ¶

Location and contents of dataset directory structure.

`project_name` `property` ¶

Project name in <config>---<dataset> format.

`project_dir` `property` ¶

Project directory path (<base_path>/<config>---<dataset>/).

Falls back to the parent of _explicit_run_path when one was provided.

`run_dir` `property` ¶

Run directory path (<base_path>/<config>---<dataset>/<run_name>/).

Uses _explicit_run_path directly when one is provided.

`log_file` `property` ¶

Log file path for the current phase.

`adapter_path` `property` ¶

Shortcut to train.adapter.path (adapter directory).

When this workdir has a parent (e.g., a generation run spawned from training), returns the parent's adapter path since that's where the trained adapter lives.

`metadata_file` `property` ¶

Shortcut to train.adapter.metadata.

Uses parent workdir's path when available.

`schema_file` `property` ¶

Shortcut to train.adapter.schema.

Uses parent workdir's path when available.

`dataset_schema_file` `property` ¶

Alias for schema_file (backwards compatibility).

`output_file` `property` ¶

Shortcut to generate.output.

`evaluation_report` `property` ¶

Shortcut to generate.report.

`evaluation_metrics` `property` ¶

Shortcut to generate.evaluation_metrics.

`source_run_dir` `property` ¶

Source run directory (parent's run_dir for child generation runs).

`source_config` `property` ¶

Source config file path (from parent workdir if available).

Checks multiple locations for backwards compatibility: 1. Root level config: <run_dir>/safe-synthesizer-config.json 2. Train config: <run_dir>/train/safe-synthesizer-config.json

`source_adapter_path` `property` ¶

Source adapter path (from parent workdir if available).

`source_dataset` `property` ¶

Source dataset directory (from parent workdir if available).

`source_schema_file` `property` ¶

Source schema file path (from parent workdir if available).

`phase_dir(phase=None)` ¶

Get the phase directory path.

Parameters:

Name	Type	Description	Default
`phase`	`str \| None`	Phase name (train, generate, etc.). Defaults to _current_phase.	`None`

Returns:

Type	Description
`Path`	Path to the phase directory

Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py

def phase_dir(self, phase: str | None = None) -> Path:
    """Get the phase directory path.

    Args:
        phase: Phase name (train, generate, etc.). Defaults to _current_phase.

    Returns:
        Path to the phase directory
    """
    phase = phase or self._current_phase
    return self.run_dir / phase

`ensure_directories()` ¶

Create directories based on the current phase.

For training runs: creates train/, generate/, and dataset/ directories For generation-only runs: creates only generate/ directory and writes info.txt

Returns:

Type	Description
`Self`	self for method chaining

Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py

def ensure_directories(self) -> Self:
    """Create directories based on the current phase.

    For training runs: creates ``train/``, ``generate/``, and ``dataset/`` directories
    For generation-only runs: creates only ``generate/`` directory and writes info.txt

    Returns:
        self for method chaining
    """
    self.run_dir.mkdir(parents=True, exist_ok=True)

    if self._current_phase == "generate" and self._parent_workdir is not None:
        # Generation-only run - only create generate directory
        # Train and dataset are in the parent workdir
        self.generate.path.mkdir(parents=True, exist_ok=True)  # type: ignore[union-attr]
        self._write_generation_info()
    else:
        # Training run or end-to-end - create all directories
        self.train.cache.path.mkdir(parents=True, exist_ok=True)  # type: ignore[union-attr]
        self.train.adapter.path.mkdir(parents=True, exist_ok=True)  # type: ignore[union-attr]
        self.generate.path.mkdir(parents=True, exist_ok=True)  # type: ignore[union-attr]
        self.dataset.path.mkdir(parents=True, exist_ok=True)  # type: ignore[union-attr]

    return self

`new_generation_run()` ¶

Create a new Workdir for a generation run from this workdir.

This method is used when resuming from a trained model to run generation. The new Workdir shares the same project but gets a new run_name, and references this workdir as the parent for loading config/data/adapter.

Returns:

Type	Description
`Self`	New Workdir instance with a fresh timestamp-based run_name and this workdir as parent

Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py

def new_generation_run(self) -> Self:
    """Create a new Workdir for a generation run from this workdir.

    This method is used when resuming from a trained model to run generation.
    The new Workdir shares the same project but gets a new run_name, and
    references this workdir as the parent for loading config/data/adapter.

    Returns:
        New Workdir instance with a fresh timestamp-based run_name and this workdir as parent
    """
    # Create a new RunName with a fresh timestamp (auto-generated)
    new_run_name = RunName()
    logger.info(f"Created new generation run: {new_run_name.to_string()}")
    logger.info(f"Parent workdir (for adapter/config/data): {self.run_dir}")

    return self.__class__(
        base_path=self.base_path,
        config_name=self.config_name,
        dataset_name=self.dataset_name,
        run_name=new_run_name.to_string(),
        _current_phase="generate",
        _parent_workdir=self,
    )

`from_explicit_run_path(run_path, config_name, dataset_name, current_phase='unknown')` `classmethod` ¶

Create Workdir from an explicit run path (no auto-generated nesting).

Used when --run-path is provided on the CLI. The path is used directly as the run directory, without the normal / nesting.

Parameters:

Name	Type	Description	Default
`run_path`	`Path`	Explicit path to use as the run directory	required
`config_name`	`str`	Name of the config (used for project naming)	required
`dataset_name`	`str`	Name of the dataset (used for project naming)	required
`current_phase`	`str`	The current phase (train, generate, end_to_end)	`'unknown'`

Returns:

Type	Description
`Workdir`	Workdir with run_dir set to run_path

Raises:

Type	Description
`ValueError`	If run_path already contains a trained adapter

Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py

@classmethod
def from_explicit_run_path(
    cls,
    run_path: Path,
    config_name: str,
    dataset_name: str,
    current_phase: str = "unknown",
) -> Workdir:
    """Create Workdir from an explicit run path (no auto-generated nesting).

    Used when --run-path is provided on the CLI. The path is used directly
    as the run directory, without the normal <project>/<timestamp> nesting.

    Args:
        run_path: Explicit path to use as the run directory
        config_name: Name of the config (used for project naming)
        dataset_name: Name of the dataset (used for project naming)
        current_phase: The current phase (train, generate, end_to_end)

    Returns:
        Workdir with run_dir set to run_path

    Raises:
        ValueError: If run_path already contains a trained adapter
    """
    run_path = Path(run_path).resolve()

    # Check if path already contains a previous run (Option A: error)
    adapter_dir = run_path / "train" / "adapter"
    if adapter_dir.is_dir():
        adapter_files = list(adapter_dir.glob("*.safetensors"))
        if adapter_files:
            raise ValueError(
                f"--run-path '{run_path}' already contains a training run.\n"
                f"Use a different path or delete the existing run."
            )

    # For explicit paths, we store the path directly and use _explicit_run_path
    # to override the normal run_dir calculation. The base_path and run_name are
    # set for metadata purposes but won't affect the actual directory location.
    run_name = run_path.name
    base_path = run_path.parent

    logger.info(f"Using explicit run path: {run_path}")

    return cls(
        base_path=base_path,
        config_name=config_name,
        dataset_name=dataset_name,
        run_name=run_name,
        _current_phase=current_phase,
        _explicit_run_path=run_path,
    )

`from_path(path)` `classmethod` ¶

Load a Workdir from an existing path.

This method handles three scenarios: 1. Path is a run_dir (contains train/adapter/ with safetensors) - use it directly 2. Path is a project_dir - find the latest run within that project 3. Path is a base_path - find the latest run across all projects

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to run_dir, project_dir, or base_path	required

Returns:

Type	Description
`Workdir`	Workdir pointing to the existing run

Raises:

Type	Description
`ValueError`	If path doesn't exist or no valid run found

Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py

@classmethod
def from_path(cls, path: Path) -> Workdir:
    """Load a Workdir from an existing path.

    This method handles three scenarios:
    1. Path is a run_dir (contains train/adapter/ with safetensors) - use it directly
    2. Path is a project_dir - find the latest run within that project
    3. Path is a base_path - find the latest run across all projects

    Args:
        path: Path to run_dir, project_dir, or base_path

    Returns:
        Workdir pointing to the existing run

    Raises:
        ValueError: If path doesn't exist or no valid run found
    """
    if not path.is_dir():
        raise ValueError(f"Invalid path: {path}")

    # Check if this is a run_dir (has train/adapter/ subdirectory with safetensors)
    train_dir = path / "train"
    adapter_dir = train_dir / "adapter" if train_dir.is_dir() else path / "adapter"

    if adapter_dir.is_dir():
        adapter_files = list(adapter_dir.glob("*.safetensors"))
        if adapter_files:
            # This is a run_dir - parse structure from path
            # Path structure: base_path/<config>---<dataset>/<run_name>
            run_name = path.name
            project_dir = path.parent
            base_path = project_dir.parent

            # Parse project name using pattern matching helper
            config_name, dataset_name = _parse_project_name(project_dir.name)

            logger.info(f"Found existing workdir at {path}")
            logger.info(f"Adapter files: {adapter_files}")

            return cls(
                base_path=base_path,
                config_name=config_name,
                dataset_name=dataset_name,
                run_name=run_name,
            )

    # Check if this is a project_dir - find the latest run with an adapter
    adapter_files = list(path.glob("*/train/adapter/*.safetensors"))
    if adapter_files:
        adapter_files.sort(key=lambda f: f.stat().st_mtime, reverse=True)
        latest_adapter = adapter_files[0]
        run_dir = latest_adapter.parent.parent.parent  # adapter file -> adapter -> train -> run_dir

        # Parse project name using pattern matching helper
        config_name, dataset_name = _parse_project_name(path.name)

        logger.info(f"Found {len(adapter_files)} runs with adapters in {path}")
        logger.info(f"Using most recent run: {run_dir}")

        return cls(
            base_path=path.parent,
            config_name=config_name,
            dataset_name=dataset_name,
            run_name=run_dir.name,
        )

    # Check if this is a base_path - find the latest run across all projects
    adapter_files = list(path.glob("*/*/train/adapter/*.safetensors"))
    if adapter_files:
        adapter_files.sort(key=lambda f: f.stat().st_mtime, reverse=True)
        latest_adapter = adapter_files[0]
        # adapter file -> adapter -> train -> run_dir -> project_dir
        run_dir = latest_adapter.parent.parent.parent
        project_dir = run_dir.parent

        # Parse project name using pattern matching helper
        config_name, dataset_name = _parse_project_name(project_dir.name)

        logger.info(f"Found {len(adapter_files)} runs with adapters across all projects in {path}")
        logger.info(f"Using most recent run: {run_dir}")

        return cls(
            base_path=path,
            config_name=config_name,
            dataset_name=dataset_name,
            run_name=run_dir.name,
        )

    raise ValueError(f"No valid run found in {path}")

artifact_structure