Skip to content

utils

utils

CLI utility functions for Safe Synthesizer.

This module provides utility functions for CLI commands including:

  • Logging initialization
  • Dataset loading
  • Configuration merging
  • Result saving

Functions:

Name Description
common_setup

Common setup for all run commands using unified CLISettings.

merge_overrides

Merge overrides into a SafeSynthesizerParameters object.

Attributes:

Name Type Description
CLI_NESTED_FIELD_SEPARATOR

Separator used to denote nested fields in CLI options. e.g., --data__holdout=0.1

VERBOSITY_TO_LOG_LEVEL dict[int, Literal['INFO', 'DEBUG', 'DEBUG_DEPENDENCIES']]

Mapping from CLI verbosity level to log level.

CLI_NESTED_FIELD_SEPARATOR = '__' module-attribute

Separator used to denote nested fields in CLI options. e.g., --data__holdout=0.1

VERBOSITY_TO_LOG_LEVEL = {0: 'INFO', 1: 'DEBUG', 2: 'DEBUG_DEPENDENCIES'} module-attribute

Mapping from CLI verbosity level to log level.

common_setup(settings, resume=False, phase=None, auto_discover_adapter=False, wandb_resume_job_id=None)

Common setup for all run commands using unified CLISettings.

The setup order is: 1. Create Workdir (establishes artifact paths) 2. Initialize logging (using workdir.log_file) 3. Create DatasetRegistry from settings.dataset_registry if present, otherwise create an empty registry 4. Load dataset from registry if settings.data_source is a known name, otherwise from data_source 5. Load config with overrides from dataset overrides and command line overrides 6. Initialize wandb

Parameters:

Name Type Description Default
settings 'CLISettings'

Unified CLI settings (includes all config from env vars and CLI args)

required
resume bool

If True, attempt to resume from an existing workdir

False
phase str | None

The current phase (train, generate, end_to_end)

None
auto_discover_adapter bool

If True and resume=True, auto-discover the latest trained adapter

False
wandb_resume_job_id str | None

Optional wandb run ID or path to file containing the ID to resume

None

Returns:

Type Description
'CategoryLogger'

Tuple of (logger, config, dataframe, workdir). For generate-only runs with

SafeSynthesizerParameters

cached datasets, dataframe may be None (loaded from cached files by SafeSynthesizer).

Source code in src/nemo_safe_synthesizer/cli/utils.py
def common_setup(
    settings: "CLISettings",
    resume: bool = False,
    phase: str | None = None,
    auto_discover_adapter: bool = False,
    wandb_resume_job_id: str | None = None,
) -> tuple["CategoryLogger", SafeSynthesizerParameters, pd.DataFrame | None, Workdir]:
    """Common setup for all run commands using unified CLISettings.

    The setup order is:
    1. Create Workdir (establishes artifact paths)
    2. Initialize logging (using workdir.log_file)
    3. Create DatasetRegistry from settings.dataset_registry if present, otherwise create an empty registry
    4. Load dataset from registry if settings.data_source is a known name, otherwise from data_source
    5. Load config with overrides from dataset overrides and command line overrides
    6. Initialize wandb

    Args:
        settings: Unified CLI settings (includes all config from env vars and CLI args)
        resume: If True, attempt to resume from an existing workdir
        phase: The current phase (train, generate, end_to_end)
        auto_discover_adapter: If True and resume=True, auto-discover the latest trained adapter
        wandb_resume_job_id: Optional wandb run ID or path to file containing the ID to resume

    Returns:
        Tuple of (logger, config, dataframe, workdir). For generate-only runs with
        cached datasets, dataframe may be None (loaded from cached files by SafeSynthesizer).
    """
    # 1. Create workdir FIRST - this establishes all artifact paths
    workdir = _create_workdir(
        settings.artifact_path,
        settings.run_path,
        settings.data_source,
        settings.config_path,
        resume=resume,
        phase=phase,
        auto_discover_adapter=auto_discover_adapter,
    )

    # Ensure directories exist
    workdir.ensure_directories()

    # 2. Initialize logging using the workdir structure and settings
    run_logger = _initialize_logging_for_cli_from_settings(
        settings=settings,
        workdir=workdir,
    )

    # 3. Create DatasetRegistry
    if settings.dataset_registry:
        dataset_registry = DatasetRegistry.from_yaml(settings.dataset_registry)
    else:
        dataset_registry = DatasetRegistry()

    # 4. Load dataset (or check for cached dataset in resume mode)

    # synthesis_overrides collects config overrides from dataset registry and
    # CLI, which is then combined with the config file when calling
    # merge_overrides(). CLI takes top precedence, then dataset registry, and
    # finally the config file. See test_utils.py and especially
    # test_overrides_config_registry_and_cli for examples of how the resolution
    # is expected to work.
    synthesis_overrides: dict[str, Any] | None = dict()
    df: pd.DataFrame | None = None
    if settings.data_source:
        dataset_info = dataset_registry.get_dataset(settings.data_source)
        synthesis_overrides = merge_dicts(synthesis_overrides, dataset_info.overrides or dict())
        df = dataset_info.fetch()
    elif resume:
        # For generate-only runs without --data-source, verify cached dataset exists.
        # test.csv may legitimately be absent when holdout=0.
        cached_training: Path = workdir.source_dataset.training  # type: ignore[assignment]
        if not cached_training.exists():
            raise click.ClickException(
                f"No cached dataset found in workdir: {workdir.source_dataset.path}\n\n"
                "Either provide --data-source to load a dataset, or ensure the workdir "
                "contains cached training data from a previous run."
            )
        run_logger.info(f"Using cached dataset from: {workdir.source_dataset.path}")
        # df is None - SafeSynthesizer.load_from_save_path() will load from cached files
    else:
        # Should not happen - _create_workdir already validates this
        raise click.ClickException("--data-source is required for new runs")

    # 5. Load config with overrides from settings
    synthesis_overrides = merge_dicts(synthesis_overrides, settings.synthesis_overrides)
    config = merge_overrides(settings.config_path, synthesis_overrides)

    # 6. Initialize wandb (uses workdir for run ID files)
    initialize_wandb_run(workdir, resume_job_id=wandb_resume_job_id, cfg=config)

    return run_logger, config, df, workdir

merge_overrides(config_path, overrides)

Merge overrides into a SafeSynthesizerParameters object.

If config_path is None, use the overrides to create a new SafeSynthesizerParameters object. Otherwise, merge the overrides into the config file.

Parameters:

Name Type Description Default
config_path str | Path | None

Path to config file (YAML)

required
overrides dict

Dictionary of override values

required

Returns:

Type Description
SafeSynthesizerParameters

Merged SafeSynthesizerParameters

Source code in src/nemo_safe_synthesizer/cli/utils.py
def merge_overrides(config_path: str | Path | None, overrides: dict) -> SafeSynthesizerParameters:
    """Merge overrides into a SafeSynthesizerParameters object.

    If config_path is None, use the overrides to create a new SafeSynthesizerParameters object.
    Otherwise, merge the overrides into the config file.

    Args:
        config_path: Path to config file (YAML)
        overrides: Dictionary of override values

    Returns:
        Merged SafeSynthesizerParameters
    """
    try:
        if config_path is None:
            my_config = SafeSynthesizerParameters.model_validate(overrides)
        else:
            params = merge_dicts(
                SafeSynthesizerParameters.from_yaml(config_path).model_dump(exclude_unset=False), overrides
            )
            my_config = SafeSynthesizerParameters.model_validate(params)
    except ValidationError as e:
        click.echo(f"{config_path} is invalid:\n{e}")
        sys.exit(1)
    return my_config