Skip to content

utils

utils

CLI utility functions for Safe Synthesizer.

This module provides utility functions for CLI commands including:

  • Logging initialization
  • Dataset loading
  • Configuration merging
  • Result saving

Functions:

Name Description
common_setup

Common setup for all run commands using unified CLISettings.

merge_overrides

Merge overrides into a SafeSynthesizerParameters object.

Attributes:

Name Type Description
CLI_NESTED_FIELD_SEPARATOR

Separator used to denote nested fields in CLI options.

VERBOSITY_TO_LOG_LEVEL dict[int, Literal['INFO', 'DEBUG', 'DEBUG_DEPENDENCIES']]

Mapping from CLI verbosity level to log level.

CLI_NESTED_FIELD_SEPARATOR = '__' module-attribute

Separator used to denote nested fields in CLI options.

This must match the field_separator passed to pydantic_options and the field_sep used by parse_overrides; otherwise a Click option such as --data__holdout=0.1 will not become {"data": {"holdout": 0.1}}.

VERBOSITY_TO_LOG_LEVEL = {0: 'INFO', 1: 'DEBUG', 2: 'DEBUG_DEPENDENCIES'} module-attribute

Mapping from CLI verbosity level to log level.

common_setup(settings, resume=False, phase=None, auto_discover_adapter=False, wandb_resume_job_id=None, skip_wandb=False, quiet=False, run_name=None)

Common setup for all run commands using unified CLISettings.

The setup order is: 1. Create Workdir (establishes artifact paths) 2. Initialize logging (using workdir.log_file) 3. Create DatasetRegistry from settings.dataset_registry if present, otherwise create an empty registry 4. Load dataset from registry if settings.data_source is a known name, otherwise from data_source 5. Load config with overrides from dataset overrides and command line overrides 6. Initialize wandb

Parameters:

Name Type Description Default
settings 'CLISettings'

Unified CLI settings (includes all config from env vars and CLI args)

required
resume bool

If True, attempt to resume from an existing workdir

False
phase str | None

The current phase (train, generate, end_to_end)

None
auto_discover_adapter bool

If True and resume=True, auto-discover the latest trained adapter

False
wandb_resume_job_id str | None

Optional wandb run ID or path to file containing the ID to resume

None
skip_wandb bool

If True, skip wandb initialization (used by --validate / preflight)

False
quiet bool

If True, suppress console log output (file logging is unaffected)

False
run_name str | None

Explicit run name for the artifact directory (e.g. "validate"). When set, replaces the auto-generated timestamp so repeated runs reuse the same path.

None

Returns:

Type Description
'CategoryLogger'

Tuple of (logger, config, dataframe, workdir). For generate-only runs with

SafeSynthesizerParameters

cached datasets, dataframe may be None (loaded from cached files by SafeSynthesizer).

Source code in src/nemo_safe_synthesizer/cli/utils.py
def common_setup(
    settings: "CLISettings",
    resume: bool = False,
    phase: str | None = None,
    auto_discover_adapter: bool = False,
    wandb_resume_job_id: str | None = None,
    skip_wandb: bool = False,
    quiet: bool = False,
    run_name: str | None = None,
) -> tuple["CategoryLogger", SafeSynthesizerParameters, pd.DataFrame | None, Workdir]:
    """Common setup for all run commands using unified CLISettings.

    The setup order is:
    1. Create Workdir (establishes artifact paths)
    2. Initialize logging (using workdir.log_file)
    3. Create DatasetRegistry from settings.dataset_registry if present, otherwise create an empty registry
    4. Load dataset from registry if settings.data_source is a known name, otherwise from data_source
    5. Load config with overrides from dataset overrides and command line overrides
    6. Initialize wandb

    Args:
        settings: Unified CLI settings (includes all config from env vars and CLI args)
        resume: If True, attempt to resume from an existing workdir
        phase: The current phase (train, generate, end_to_end)
        auto_discover_adapter: If True and resume=True, auto-discover the latest trained adapter
        wandb_resume_job_id: Optional wandb run ID or path to file containing the ID to resume
        skip_wandb: If ``True``, skip wandb initialization (used by --validate / preflight)
        quiet: If ``True``, suppress console log output (file logging is unaffected)
        run_name: Explicit run name for the artifact directory (e.g. ``"validate"``).
            When set, replaces the auto-generated timestamp so repeated runs reuse the same path.

    Returns:
        Tuple of (logger, config, dataframe, workdir). For generate-only runs with
        cached datasets, dataframe may be None (loaded from cached files by SafeSynthesizer).
    """
    # 1. Create workdir FIRST - this establishes all artifact paths
    workdir = _create_workdir(
        settings.artifact_path,
        settings.run_path,
        settings.data_source,
        settings.config_path,
        resume=resume,
        phase=phase,
        auto_discover_adapter=auto_discover_adapter,
        run_name=run_name if run_name is not None else "",
    )

    # Ensure directories exist
    workdir.ensure_directories()

    # 2. Initialize logging using the workdir structure and settings
    run_logger = _initialize_logging_for_cli_from_settings(
        settings=settings,
        workdir=workdir,
        quiet=quiet,
    )

    # 3. Create DatasetRegistry
    if settings.dataset_registry:
        dataset_registry = DatasetRegistry.from_yaml(settings.dataset_registry)
    else:
        dataset_registry = DatasetRegistry()

    # 4. Load dataset (or check for cached dataset in resume mode)

    # synthesis_overrides collects config overrides from dataset registry and
    # CLI, which is then combined with the config file when calling
    # merge_overrides(). CLI takes top precedence, then dataset registry, and
    # finally the config file. See test_utils.py and especially
    # test_overrides_config_registry_and_cli for examples of how the resolution
    # is expected to work.
    synthesis_overrides: dict[str, Any] | None = dict()
    df: pd.DataFrame | None = None
    if settings.data_source:
        dataset_info = dataset_registry.get_dataset(settings.data_source)
        synthesis_overrides = merge_dicts(synthesis_overrides, dataset_info.overrides or dict())
        df = dataset_info.fetch()
    elif resume:
        # For generate-only runs without --data-source, verify cached dataset exists.
        # test.csv may legitimately be absent when holdout=0.
        cached_training = workdir.source_dataset.training
        assert isinstance(cached_training, Path)
        if not cached_training.exists():
            raise click.ClickException(
                f"No cached dataset found in workdir: {workdir.source_dataset.path}\n\n"
                "Either provide --data-source to load a dataset, or ensure the workdir "
                "contains cached training data from a previous run."
            )
        run_logger.info(f"Using cached dataset from: {workdir.source_dataset.path}")
        # df is None - SafeSynthesizer.load_from_save_path() will load from cached files
    else:
        # Should not happen - _create_workdir already validates this
        raise click.ClickException("--data-source is required for new runs")

    # 5. Load config with overrides from settings
    synthesis_overrides = merge_dicts(synthesis_overrides, settings.synthesis_overrides)
    config = merge_overrides(settings.config_path, synthesis_overrides)

    # 6. Initialize wandb (uses workdir for run ID files)
    if not skip_wandb:
        initialize_wandb_run(workdir, resume_job_id=wandb_resume_job_id, cfg=config)

    return run_logger, config, df, workdir

merge_overrides(config_path, overrides)

Merge overrides into a SafeSynthesizerParameters object.

If config_path is None, use the overrides to create a new SafeSynthesizerParameters object. Otherwise, merge the overrides into the config file.

Parameters:

Name Type Description Default
config_path str | Path | None

Path to config file (YAML)

required
overrides dict

Dictionary of override values

required

Returns:

Type Description
SafeSynthesizerParameters

Merged SafeSynthesizerParameters

Source code in src/nemo_safe_synthesizer/cli/utils.py
def merge_overrides(config_path: str | Path | None, overrides: dict) -> SafeSynthesizerParameters:
    """Merge overrides into a SafeSynthesizerParameters object.

    If config_path is None, use the overrides to create a new SafeSynthesizerParameters object.
    Otherwise, merge the overrides into the config file.

    Args:
        config_path: Path to config file (YAML)
        overrides: Dictionary of override values

    Returns:
        Merged SafeSynthesizerParameters
    """
    try:
        if config_path is None:
            my_config = SafeSynthesizerParameters.model_validate(overrides)
        else:
            params = merge_dicts(
                SafeSynthesizerParameters.from_yaml(config_path).model_dump(exclude_unset=False), overrides
            )
            my_config = SafeSynthesizerParameters.model_validate(params)
    except ValidationError as e:
        click.echo(f"{config_path} is invalid:\n{e}")
        sys.exit(1)
    return my_config