Skip to content

artifact_structure

artifact_structure

Artifact directory structure for Safe Synthesizer.

Defines the on-disk layout produced by each pipeline run using a declarative descriptor pattern. FileNode and DirNode descriptors declare the tree shape on Workdir; at runtime they resolve to Path and BoundDir objects respectively, giving typed access to every artifact path without hard-coding strings throughout the CLI.

Typical directory tree:

<base_path>/<config>---<dataset>/<run_name>/
- train/  ...
- generate/  ...
- dataset/  ...

See Workdir for the full structure.

Classes:

Name Description
RunName

Run name for artifact directories.

FileNode

Descriptor for file paths within a directory structure.

DirNode

Descriptor for directory paths within a directory structure.

BoundDir

Runtime class representing a bound directory path.

Workdir

Working directory structure for Safe Synthesizer artifacts.

Attributes:

Name Type Description
RUN_NAME_DATE_FORMAT

Format string for auto-generated timestamp-based run names.

PROJECT_NAME_DELIMITER

Delimiter used to separate config_name and dataset_name in project names.

RUN_NAME_DATE_FORMAT = '%Y-%m-%dT%H:%M:%S' module-attribute

Format string for auto-generated timestamp-based run names.

PROJECT_NAME_DELIMITER = '---' module-attribute

Delimiter used to separate config_name and dataset_name in project names.

Uses triple-dash to avoid ambiguity with single dashes that commonly appear in config and dataset filenames (e.g., my-config.yaml, training-data.csv).

RunName(_value='', _timestamp=None) dataclass

Run name for artifact directories.

Supports two modes: 1) Auto-generated based on timestamp or 2) an arbitrary string name provided by the user (from --run-path).

Examples:

  • Auto-generated: "2026-01-15T12:00:00"
  • Explicit: "unsloth_adult_0", "my-experiment-run"

Methods:

Name Description
to_string

Convert the run name to a string for use in directory names.

from_string

Parse a run name string into a RunName object.

Attributes:

Name Type Description
is_timestamp_based bool

Whether this run name was generated from or parsed as a timestamp.

timestamp datetime | None

Parsed timestamp, or None for non-timestamp-based run names.

is_timestamp_based property

Whether this run name was generated from or parsed as a timestamp.

timestamp property

Parsed timestamp, or None for non-timestamp-based run names.

to_string()

Convert the run name to a string for use in directory names.

Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py
def to_string(self) -> str:
    """Convert the run name to a string for use in directory names."""
    return self._value

from_string(name) classmethod

Parse a run name string into a RunName object.

Accepts any valid string. If the string matches the timestamp format, the timestamp is also stored for potential use.

Parameters:

Name Type Description Default
name str

Run name string (e.g., "2026-01-15T12:00:00" or "unsloth_adult_0").

required

Returns:

Type Description
Self

RunName with the provided name and optional parsed timestamp.

Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py
@classmethod
def from_string(cls, name: str) -> Self:
    """Parse a run name string into a RunName object.

    Accepts any valid string. If the string matches the timestamp format,
    the timestamp is also stored for potential use.

    Args:
        name: Run name string (e.g., "2026-01-15T12:00:00" or "unsloth_adult_0").

    Returns:
        RunName with the provided name and optional parsed timestamp.
    """
    ts = _try_parse_timestamp(name)
    return cls(_value=name, _timestamp=ts)

FileNode(name)

Descriptor for file paths within a directory structure.

When accessed on a class, returns the descriptor itself. When accessed on an instance, returns the full Path to the file.

Parameters:

Name Type Description Default
name str

The filename (e.g., "config.json").

required
Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py
def __init__(self, name: str):
    self.name = name
    self._attr_name: str | None = None

DirNode(name, **children)

Descriptor for directory paths within a directory structure.

Supports nested children (both FileNode and DirNode). When accessed on a class, returns the descriptor itself. When accessed on an instance, returns a BoundDir with the resolved path.

Parameters:

Name Type Description Default
name str

The directory name (e.g., "train").

required
**children FileNode | DirNode

Child nodes (FileNode or DirNode instances).

{}
Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py
def __init__(self, name: str, **children: FileNode | DirNode):
    self.name = name
    self.children: dict[str, FileNode | DirNode] = children
    self._attr_name: str | None = None

BoundDir(path, children)

Bases: PathLike[str]

Runtime class representing a bound directory path.

Provides access to child FileNode and DirNode descriptors as attributes. Implements os.PathLike[str] so instances can be used wherever paths are expected.

Parameters:

Name Type Description Default
path Path

The resolved directory path.

required
children dict[str, FileNode | DirNode]

Child nodes from the DirNode.

required

Attributes:

Name Type Description
path Path

The resolved directory path.

Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py
def __init__(self, path: Path, children: dict[str, FileNode | DirNode]):
    self._path = path
    self._children = children

path property

The resolved directory path.

Workdir(base_path, config_name, dataset_name, run_name=None, _run_name_obj=RunName(), _current_phase='unknown', _parent_workdir=None, _explicit_run_path=None) dataclass

Working directory structure for Safe Synthesizer artifacts.

This class defines the complete directory layout and provides typed access to all paths within the structure. It uses FileNode and DirNode descriptors for declarative path definitions.

Full directory structure:

<base_path>/<config>---<dataset>/<run_name>/
- train/
  - safe-synthesizer-config.json
  - cache/
  - adapter/                     (trained PEFT adapter)
    - adapter_config.json
    - adapter_model.safetensors
    - metadata_v2.json
    - dataset_schema.json
- generate/
  - logs.jsonl                   (generate-only workflow)
  - info.json                    (generate-only workflow)
  - synthetic_data.csv
  - evaluation_report.html
  - evaluation_metrics.json      (machine-readable metrics)
- dataset/
  - training.csv
  - test.csv
  - validation.csv               (when training.validation_ratio > 0)
  - transformed_training.csv     (when PII replacement transforms the data)
- logs/
  - <phase>.jsonl                (e.g. end_to_end.jsonl or train.jsonl)

Methods:

Name Description
phase_dir

Get the phase directory path.

ensure_directories

Create directories based on the current phase.

new_generation_run

Create a new Workdir for a generation run from this workdir.

from_explicit_run_path

Create Workdir from an explicit run path (no auto-generated nesting).

from_path

Load a Workdir from an existing path.

Attributes:

Name Type Description
base_path Path

Root directory under which project and run directories are created.

config_name str

Stem of the config file name, used in the project directory name.

dataset_name str

Stem of the dataset file name, used in the project directory name.

run_name str | None

Run name (auto-generated timestamp or explicit name from CLI).

config

Location for NSS config file.

wandb_run_id_file

Location for WandB run ID file.

train

Location and contents of train directory structure.

generate

Location and contents of generate directory structure.

dataset

Location and contents of dataset directory structure.

project_name str

Project name in <config>---<dataset> format.

project_dir Path

Project directory path (<base_path>/<config>---<dataset>/).

run_dir Path

Run directory path (<base_path>/<config>---<dataset>/<run_name>/).

log_file Path

Log file path for the current phase.

adapter_path Path

Shortcut to train.adapter.path (adapter directory).

metadata_file Path

Shortcut to train.adapter.metadata.

schema_file Path

Shortcut to train.adapter.schema.

dataset_schema_file Path

Alias for schema_file (backwards compatibility).

output_file Path

Shortcut to generate.output.

evaluation_report Path

Shortcut to generate.report.

evaluation_metrics Path

Shortcut to generate.evaluation_metrics.

source_run_dir Path

Source run directory (parent's run_dir for child generation runs).

source_config Path

Source config file path (from parent workdir if available).

source_adapter_path Path

Source adapter path (from parent workdir if available).

source_dataset BoundDir

Source dataset directory (from parent workdir if available).

source_schema_file Path

Source schema file path (from parent workdir if available).

base_path instance-attribute

Root directory under which project and run directories are created.

config_name instance-attribute

Stem of the config file name, used in the project directory name.

dataset_name instance-attribute

Stem of the dataset file name, used in the project directory name.

run_name = None class-attribute instance-attribute

Run name (auto-generated timestamp or explicit name from CLI).

When None, a timestamp-based name is generated in __post_init__.

config = FileNode('safe-synthesizer-config.json') class-attribute instance-attribute

Location for NSS config file.

wandb_run_id_file = FileNode('wandb_run_id.txt') class-attribute instance-attribute

Location for WandB run ID file.

train = DirNode('train', config=(FileNode('safe-synthesizer-config.json')), cache=(DirNode('cache')), adapter=(DirNode('adapter', adapter_config=(FileNode('adapter_config.json')), metadata=(FileNode('metadata_v2.json')), schema=(FileNode('dataset_schema.json'))))) class-attribute instance-attribute

Location and contents of train directory structure.

generate = DirNode('generate', logs=(FileNode('logs.jsonl')), output=(FileNode('synthetic_data.csv')), report=(FileNode('evaluation_report.html')), evaluation_metrics=(FileNode('evaluation_metrics.json')), info=(FileNode('info.json'))) class-attribute instance-attribute

Location and contents of generate directory structure.

dataset = DirNode('dataset', training=(FileNode('training.csv')), test=(FileNode('test.csv')), validation=(FileNode('validation.csv')), transformed_training=(FileNode('transformed_training.csv'))) class-attribute instance-attribute

Location and contents of dataset directory structure.

project_name property

Project name in <config>---<dataset> format.

project_dir property

Project directory path (<base_path>/<config>---<dataset>/).

Falls back to the parent of _explicit_run_path when one was provided.

run_dir property

Run directory path (<base_path>/<config>---<dataset>/<run_name>/).

Uses _explicit_run_path directly when one is provided.

log_file property

Log file path for the current phase.

adapter_path property

Shortcut to train.adapter.path (adapter directory).

When this workdir has a parent (e.g., a generation run spawned from training), returns the parent's adapter path since that's where the trained adapter lives.

metadata_file property

Shortcut to train.adapter.metadata.

Uses parent workdir's path when available.

schema_file property

Shortcut to train.adapter.schema.

Uses parent workdir's path when available.

dataset_schema_file property

Alias for schema_file (backwards compatibility).

output_file property

Shortcut to generate.output.

evaluation_report property

Shortcut to generate.report.

evaluation_metrics property

Shortcut to generate.evaluation_metrics.

source_run_dir property

Source run directory (parent's run_dir for child generation runs).

source_config property

Source config file path (from parent workdir if available).

Checks multiple locations for backwards compatibility: 1. Root level config: <run_dir>/safe-synthesizer-config.json 2. Train config: <run_dir>/train/safe-synthesizer-config.json

source_adapter_path property

Source adapter path (from parent workdir if available).

source_dataset property

Source dataset directory (from parent workdir if available).

source_schema_file property

Source schema file path (from parent workdir if available).

phase_dir(phase=None)

Get the phase directory path.

Parameters:

Name Type Description Default
phase str | None

Phase name (train, generate, etc.). Defaults to _current_phase.

None

Returns:

Type Description
Path

Path to the phase directory

Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py
def phase_dir(self, phase: str | None = None) -> Path:
    """Get the phase directory path.

    Args:
        phase: Phase name (train, generate, etc.). Defaults to _current_phase.

    Returns:
        Path to the phase directory
    """
    phase = phase or self._current_phase
    return self.run_dir / phase

ensure_directories()

Create directories based on the current phase.

For training runs: creates train/, generate/, and dataset/ directories For generation-only runs: creates only generate/ directory and writes info.txt

Returns:

Type Description
Self

self for method chaining

Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py
def ensure_directories(self) -> Self:
    """Create directories based on the current phase.

    For training runs: creates ``train/``, ``generate/``, and ``dataset/`` directories
    For generation-only runs: creates only ``generate/`` directory and writes info.txt

    Returns:
        self for method chaining
    """
    self.run_dir.mkdir(parents=True, exist_ok=True)

    if self._current_phase == "generate" and self._parent_workdir is not None:
        # Generation-only run - only create generate directory
        # Train and dataset are in the parent workdir
        self.generate.path.mkdir(parents=True, exist_ok=True)  # type: ignore[union-attr]
        self._write_generation_info()
    else:
        # Training run or end-to-end - create all directories
        self.train.cache.path.mkdir(parents=True, exist_ok=True)  # type: ignore[union-attr]
        self.train.adapter.path.mkdir(parents=True, exist_ok=True)  # type: ignore[union-attr]
        self.generate.path.mkdir(parents=True, exist_ok=True)  # type: ignore[union-attr]
        self.dataset.path.mkdir(parents=True, exist_ok=True)  # type: ignore[union-attr]

    return self

new_generation_run()

Create a new Workdir for a generation run from this workdir.

This method is used when resuming from a trained model to run generation. The new Workdir shares the same project but gets a new run_name, and references this workdir as the parent for loading config/data/adapter.

Returns:

Type Description
Self

New Workdir instance with a fresh timestamp-based run_name and this workdir as parent

Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py
def new_generation_run(self) -> Self:
    """Create a new Workdir for a generation run from this workdir.

    This method is used when resuming from a trained model to run generation.
    The new Workdir shares the same project but gets a new run_name, and
    references this workdir as the parent for loading config/data/adapter.

    Returns:
        New Workdir instance with a fresh timestamp-based run_name and this workdir as parent
    """
    # Create a new RunName with a fresh timestamp (auto-generated)
    new_run_name = RunName()
    logger.info(f"Created new generation run: {new_run_name.to_string()}")
    logger.info(f"Parent workdir (for adapter/config/data): {self.run_dir}")

    return self.__class__(
        base_path=self.base_path,
        config_name=self.config_name,
        dataset_name=self.dataset_name,
        run_name=new_run_name.to_string(),
        _current_phase="generate",
        _parent_workdir=self,
    )

from_explicit_run_path(run_path, config_name, dataset_name, current_phase='unknown') classmethod

Create Workdir from an explicit run path (no auto-generated nesting).

Used when --run-path is provided on the CLI. The path is used directly as the run directory, without the normal / nesting.

Parameters:

Name Type Description Default
run_path Path

Explicit path to use as the run directory

required
config_name str

Name of the config (used for project naming)

required
dataset_name str

Name of the dataset (used for project naming)

required
current_phase str

The current phase (train, generate, end_to_end)

'unknown'

Returns:

Type Description
Workdir

Workdir with run_dir set to run_path

Raises:

Type Description
ValueError

If run_path already contains a trained adapter

Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py
@classmethod
def from_explicit_run_path(
    cls,
    run_path: Path,
    config_name: str,
    dataset_name: str,
    current_phase: str = "unknown",
) -> Workdir:
    """Create Workdir from an explicit run path (no auto-generated nesting).

    Used when --run-path is provided on the CLI. The path is used directly
    as the run directory, without the normal <project>/<timestamp> nesting.

    Args:
        run_path: Explicit path to use as the run directory
        config_name: Name of the config (used for project naming)
        dataset_name: Name of the dataset (used for project naming)
        current_phase: The current phase (train, generate, end_to_end)

    Returns:
        Workdir with run_dir set to run_path

    Raises:
        ValueError: If run_path already contains a trained adapter
    """
    run_path = Path(run_path).resolve()

    # Check if path already contains a previous run (Option A: error)
    adapter_dir = run_path / "train" / "adapter"
    if adapter_dir.is_dir():
        adapter_files = list(adapter_dir.glob("*.safetensors"))
        if adapter_files:
            raise ValueError(
                f"--run-path '{run_path}' already contains a training run.\n"
                f"Use a different path or delete the existing run."
            )

    # For explicit paths, we store the path directly and use _explicit_run_path
    # to override the normal run_dir calculation. The base_path and run_name are
    # set for metadata purposes but won't affect the actual directory location.
    run_name = run_path.name
    base_path = run_path.parent

    logger.info(f"Using explicit run path: {run_path}")

    return cls(
        base_path=base_path,
        config_name=config_name,
        dataset_name=dataset_name,
        run_name=run_name,
        _current_phase=current_phase,
        _explicit_run_path=run_path,
    )

from_path(path) classmethod

Load a Workdir from an existing path.

This method handles three scenarios: 1. Path is a run_dir (contains train/adapter/ with safetensors) - use it directly 2. Path is a project_dir - find the latest run within that project 3. Path is a base_path - find the latest run across all projects

Parameters:

Name Type Description Default
path Path

Path to run_dir, project_dir, or base_path

required

Returns:

Type Description
Workdir

Workdir pointing to the existing run

Raises:

Type Description
ValueError

If path doesn't exist or no valid run found

Source code in src/nemo_safe_synthesizer/cli/artifact_structure.py
@classmethod
def from_path(cls, path: Path) -> Workdir:
    """Load a Workdir from an existing path.

    This method handles three scenarios:
    1. Path is a run_dir (contains train/adapter/ with safetensors) - use it directly
    2. Path is a project_dir - find the latest run within that project
    3. Path is a base_path - find the latest run across all projects

    Args:
        path: Path to run_dir, project_dir, or base_path

    Returns:
        Workdir pointing to the existing run

    Raises:
        ValueError: If path doesn't exist or no valid run found
    """
    if not path.is_dir():
        raise ValueError(f"Invalid path: {path}")

    # Check if this is a run_dir (has train/adapter/ subdirectory with safetensors)
    train_dir = path / "train"
    adapter_dir = train_dir / "adapter" if train_dir.is_dir() else path / "adapter"

    if adapter_dir.is_dir():
        adapter_files = list(adapter_dir.glob("*.safetensors"))
        if adapter_files:
            # This is a run_dir - parse structure from path
            # Path structure: base_path/<config>---<dataset>/<run_name>
            run_name = path.name
            project_dir = path.parent
            base_path = project_dir.parent

            # Parse project name using pattern matching helper
            config_name, dataset_name = _parse_project_name(project_dir.name)

            logger.info(f"Found existing workdir at {path}")
            logger.info(f"Adapter files: {adapter_files}")

            return cls(
                base_path=base_path,
                config_name=config_name,
                dataset_name=dataset_name,
                run_name=run_name,
            )

    # Check if this is a project_dir - find the latest run with an adapter
    adapter_files = list(path.glob("*/train/adapter/*.safetensors"))
    if adapter_files:
        adapter_files.sort(key=lambda f: f.stat().st_mtime, reverse=True)
        latest_adapter = adapter_files[0]
        run_dir = latest_adapter.parent.parent.parent  # adapter file -> adapter -> train -> run_dir

        # Parse project name using pattern matching helper
        config_name, dataset_name = _parse_project_name(path.name)

        logger.info(f"Found {len(adapter_files)} runs with adapters in {path}")
        logger.info(f"Using most recent run: {run_dir}")

        return cls(
            base_path=path.parent,
            config_name=config_name,
            dataset_name=dataset_name,
            run_name=run_dir.name,
        )

    # Check if this is a base_path - find the latest run across all projects
    adapter_files = list(path.glob("*/*/train/adapter/*.safetensors"))
    if adapter_files:
        adapter_files.sort(key=lambda f: f.stat().st_mtime, reverse=True)
        latest_adapter = adapter_files[0]
        # adapter file -> adapter -> train -> run_dir -> project_dir
        run_dir = latest_adapter.parent.parent.parent
        project_dir = run_dir.parent

        # Parse project name using pattern matching helper
        config_name, dataset_name = _parse_project_name(project_dir.name)

        logger.info(f"Found {len(adapter_files)} runs with adapters across all projects in {path}")
        logger.info(f"Using most recent run: {run_dir}")

        return cls(
            base_path=path,
            config_name=config_name,
            dataset_name=dataset_name,
            run_name=run_dir.name,
        )

    raise ValueError(f"No valid run found in {path}")