utils

`utils` ¶

CLI utility functions for Safe Synthesizer.

This module provides utility functions for CLI commands including:

Logging initialization
Dataset loading
Configuration merging
Result saving

Functions:

Name	Description
`common_setup`	Common setup for all run commands using unified CLISettings.
`merge_overrides`	Merge overrides into a SafeSynthesizerParameters object.

Attributes:

Name	Type	Description
`CLI_NESTED_FIELD_SEPARATOR`		Separator used to denote nested fields in CLI options.
`VERBOSITY_TO_LOG_LEVEL`	`dict[int, Literal['INFO', 'DEBUG', 'DEBUG_DEPENDENCIES']]`	Mapping from CLI verbosity level to log level.

`CLI_NESTED_FIELD_SEPARATOR = '__'` `module-attribute` ¶

Separator used to denote nested fields in CLI options.

This must match the field_separator passed to pydantic_options and the field_sep used by parse_overrides; otherwise a Click option such as --data__holdout=0.1 will not become {"data": {"holdout": 0.1}}.

`VERBOSITY_TO_LOG_LEVEL = {0: 'INFO', 1: 'DEBUG', 2: 'DEBUG_DEPENDENCIES'}` `module-attribute` ¶

Mapping from CLI verbosity level to log level.

`common_setup(settings, resume=False, phase=None, auto_discover_adapter=False, wandb_resume_job_id=None, skip_wandb=False, quiet=False, run_name=None)` ¶

Common setup for all run commands using unified CLISettings.

The setup order is: 1. Create Workdir (establishes artifact paths) 2. Initialize logging (using workdir.log_file) 3. Create DatasetRegistry from settings.dataset_registry if present, otherwise create an empty registry 4. Load dataset from registry if settings.data_source is a known name, otherwise from data_source 5. Load config with overrides from dataset overrides and command line overrides 6. Initialize wandb

Parameters:

Name	Type	Description	Default
`settings`	`'CLISettings'`	Unified CLI settings (includes all config from env vars and CLI args)	required
`resume`	`bool`	If True, attempt to resume from an existing workdir	`False`
`phase`	`str \| None`	The current phase (train, generate, end_to_end)	`None`
`auto_discover_adapter`	`bool`	If True and resume=True, auto-discover the latest trained adapter	`False`
`wandb_resume_job_id`	`str \| None`	Optional wandb run ID or path to file containing the ID to resume	`None`
`skip_wandb`	`bool`	If `True`, skip wandb initialization (used by --validate / preflight)	`False`
`quiet`	`bool`	If `True`, suppress console log output (file logging is unaffected)	`False`
`run_name`	`str \| None`	Explicit run name for the artifact directory (e.g. `"validate"`). When set, replaces the auto-generated timestamp so repeated runs reuse the same path.	`None`

Returns:

Type	Description
`'CategoryLogger'`	Tuple of (logger, config, dataframe, workdir). For generate-only runs with
`SafeSynthesizerParameters`	cached datasets, dataframe may be None (loaded from cached files by SafeSynthesizer).

Source code in src/nemo_safe_synthesizer/cli/utils.py

def common_setup(
    settings: "CLISettings",
    resume: bool = False,
    phase: str | None = None,
    auto_discover_adapter: bool = False,
    wandb_resume_job_id: str | None = None,
    skip_wandb: bool = False,
    quiet: bool = False,
    run_name: str | None = None,
) -> tuple["CategoryLogger", SafeSynthesizerParameters, pd.DataFrame | None, Workdir]:
    """Common setup for all run commands using unified CLISettings.

    The setup order is:
    1. Create Workdir (establishes artifact paths)
    2. Initialize logging (using workdir.log_file)
    3. Create DatasetRegistry from settings.dataset_registry if present, otherwise create an empty registry
    4. Load dataset from registry if settings.data_source is a known name, otherwise from data_source
    5. Load config with overrides from dataset overrides and command line overrides
    6. Initialize wandb

    Args:
        settings: Unified CLI settings (includes all config from env vars and CLI args)
        resume: If True, attempt to resume from an existing workdir
        phase: The current phase (train, generate, end_to_end)
        auto_discover_adapter: If True and resume=True, auto-discover the latest trained adapter
        wandb_resume_job_id: Optional wandb run ID or path to file containing the ID to resume
        skip_wandb: If ``True``, skip wandb initialization (used by --validate / preflight)
        quiet: If ``True``, suppress console log output (file logging is unaffected)
        run_name: Explicit run name for the artifact directory (e.g. ``"validate"``).
            When set, replaces the auto-generated timestamp so repeated runs reuse the same path.

    Returns:
        Tuple of (logger, config, dataframe, workdir). For generate-only runs with
        cached datasets, dataframe may be None (loaded from cached files by SafeSynthesizer).
    """
    # 0. Propagate CLI-resolved runtime settings back to os.environ. This must
    # run before any deferred pii_replacer imports so that module-level reads
    # of NSS_INFERENCE_*, HF_HUB_OFFLINE/TRANSFORMERS_OFFLINE, and
    # NSS_PII_REPLACER_CPU_COUNT see the CLI-overridden values.
    _propagate_runtime_settings_to_env(settings)

    # 1. Create workdir FIRST - this establishes all artifact paths
    workdir = _create_workdir(
        settings.artifact_path,
        settings.run_path,
        settings.data_source,
        settings.config_path,
        resume=resume,
        phase=phase,
        auto_discover_adapter=auto_discover_adapter,
        run_name=run_name if run_name is not None else "",
    )

    # Ensure directories exist
    workdir.ensure_directories()

    # 2. Initialize logging using the workdir structure and settings
    run_logger = _initialize_logging_for_cli_from_settings(
        settings=settings,
        workdir=workdir,
        quiet=quiet,
    )

    # 3. Create DatasetRegistry
    if settings.dataset_registry:
        dataset_registry = DatasetRegistry.from_yaml(settings.dataset_registry)
    else:
        dataset_registry = DatasetRegistry()

    # 4. Load dataset (or check for cached dataset in resume mode)

    # synthesis_overrides collects config overrides from dataset registry and
    # CLI, which is then combined with the config file when calling
    # merge_overrides(). CLI takes top precedence, then dataset registry, and
    # finally the config file. See test_utils.py and especially
    # test_overrides_config_registry_and_cli for examples of how the resolution
    # is expected to work.
    synthesis_overrides: dict[str, Any] | None = dict()
    df: pd.DataFrame | None = None
    if settings.data_source:
        dataset_info = dataset_registry.get_dataset(settings.data_source)
        if not resume:
            synthesis_overrides = merge_dicts(synthesis_overrides, dataset_info.overrides or dict())
        df = dataset_info.fetch()
    elif resume:
        # For generate-only runs without --data-source, verify cached dataset exists.
        # test.csv may legitimately be absent when holdout=0.
        cached_training = workdir.source_dataset.training
        assert isinstance(cached_training, Path)
        if not cached_training.exists():
            raise click.ClickException(
                f"No cached dataset found in workdir: {workdir.source_dataset.path}\n\n"
                "Either provide --data-source to load a dataset, or ensure the workdir "
                "contains cached training data from a previous run."
            )
        run_logger.info(f"Using cached dataset from: {workdir.source_dataset.path}")
        # df is None - SafeSynthesizer.load_from_save_path() will load from cached files
    else:
        # Should not happen - _create_workdir already validates this
        raise click.ClickException("--data-source is required for new runs")

    # 5. Load config with overrides from settings
    synthesis_overrides = merge_dicts(synthesis_overrides, settings.synthesis_overrides)
    config = merge_overrides(settings.config_path, synthesis_overrides)

    # 6. Initialize wandb (uses workdir for run ID files)
    if not skip_wandb:
        initialize_wandb_run(workdir, resume_job_id=wandb_resume_job_id, cfg=config)

    return run_logger, config, df, workdir

`merge_overrides(config_path, overrides)` ¶

Merge overrides into a SafeSynthesizerParameters object.

If config_path is None, use the overrides to create a new SafeSynthesizerParameters object. Otherwise, merge the overrides into the config file.

Parameters:

Name	Type	Description	Default
`config_path`	`str \| Path \| None`	Path to config file (YAML)	required
`overrides`	`dict`	Dictionary of override values	required

Returns:

Type	Description
`SafeSynthesizerParameters`	Merged SafeSynthesizerParameters

Source code in src/nemo_safe_synthesizer/cli/utils.py

def merge_overrides(config_path: str | Path | None, overrides: dict) -> SafeSynthesizerParameters:
    """Merge overrides into a SafeSynthesizerParameters object.

    If config_path is None, use the overrides to create a new SafeSynthesizerParameters object.
    Otherwise, merge the overrides into the config file.

    Args:
        config_path: Path to config file (YAML)
        overrides: Dictionary of override values

    Returns:
        Merged SafeSynthesizerParameters
    """
    try:
        if config_path is None:
            my_config = SafeSynthesizerParameters.model_validate(overrides)
        else:
            file_config = SafeSynthesizerParameters.from_yaml(config_path).model_dump(exclude_unset=True)
            params = merge_dicts(file_config, overrides)
            my_config = SafeSynthesizerParameters.model_validate(params)
    except ValidationError as e:
        click.echo(f"{config_path} is invalid:\n{e}")
        sys.exit(1)
    return my_config

utils

utils ¶

CLI_NESTED_FIELD_SEPARATOR = '__' module-attribute ¶

VERBOSITY_TO_LOG_LEVEL = {0: 'INFO', 1: 'DEBUG', 2: 'DEBUG_DEPENDENCIES'} module-attribute ¶

common_setup(settings, resume=False, phase=None, auto_discover_adapter=False, wandb_resume_job_id=None, skip_wandb=False, quiet=False, run_name=None) ¶

merge_overrides(config_path, overrides) ¶

`utils` ¶

`CLI_NESTED_FIELD_SEPARATOR = '__'` `module-attribute` ¶

`VERBOSITY_TO_LOG_LEVEL = {0: 'INFO', 1: 'DEBUG', 2: 'DEBUG_DEPENDENCIES'}` `module-attribute` ¶

`common_setup(settings, resume=False, phase=None, auto_discover_adapter=False, wandb_resume_job_id=None, skip_wandb=False, quiet=False, run_name=None)` ¶

`merge_overrides(config_path, overrides)` ¶