Skip to content

orchestrator

orchestrator

Preflight execution entry point.

run_preflight is the single public entry point; _run_registry handles per-check gating, failure isolation, and result aggregation.

Functions:

Name Description
run_preflight

Execute all pre-flight checks against the training split.

Attributes:

Name Type Description
CRASH_CODE

Issue code used when a check raises from enabled() or run().

CRASH_CODE = 'preflight.check_crash' module-attribute

Issue code used when a check raises from enabled() or run().

run_preflight(data, config, metadata, *, registry=None, stages=None)

Execute all pre-flight checks against the training split.

Parameters:

Name Type Description Default
data DataFrame

The training split produced by Holdout.train_test_split. On a full run this is also post-PII replacement; on --validate PII replacement is skipped. Row counts, group sizes, and column statistics reflect this partition, not the original input dataset.

required
config SafeSynthesizerParameters

Resolved configuration (AutoConfigResolver already ran).

required
metadata ModelMetadata

Model metadata (tokenizer and context length).

required
stages frozenset[PreflightStage] | None

Optional subset of stages to execute. Used when callers need early DataFrame validation before later processing has produced the final training split.

None

Returns:

Type Description
PreflightReport

A structured PreflightReport.

Source code in src/nemo_safe_synthesizer/preflight/orchestrator.py
@traced("preflight", category=LogCategory.USER)
def run_preflight(
    data: pd.DataFrame,
    config: SafeSynthesizerParameters,
    metadata: ModelMetadata,
    *,
    registry: PreflightRegistry | None = None,
    stages: frozenset[PreflightStage] | None = None,
) -> PreflightReport:
    """Execute all pre-flight checks against the training split.

    Args:
        data: The training split produced by ``Holdout.train_test_split``.
            On a full run this is also post-PII replacement; on
            ``--validate`` PII replacement is skipped. Row counts, group
            sizes, and column statistics reflect this partition, not the
            original input dataset.
        config: Resolved configuration (``AutoConfigResolver`` already ran).
        metadata: Model metadata (tokenizer and context length).
        stages: Optional subset of stages to execute. Used when callers
            need early DataFrame validation before later processing has
            produced the final training split.

    Returns:
        A structured ``PreflightReport``.
    """
    effective_registry = _registry.get_registry() if registry is None else registry
    _warn_unknown_disabled_checks(config, effective_registry)

    ctx = PreflightContext(data=data, config=config, metadata=metadata)
    report = PreflightReport(checks=_run_registry(ctx, effective_registry, stages=stages))
    n_checks = len(report.checks)
    n_skipped = sum(1 for c in report.checks if c.status == "skipped")
    n_errors = len(report.errors)
    n_warns = len(report.warnings)
    logger.user.info(
        f"Preflight: {n_checks - n_skipped} check(s) ran, {n_skipped} skipped — "
        f"{n_errors} error(s), {n_warns} warning(s)",
    )
    logger.runtime.debug(
        "Preflight complete",
        extra={
            "errors": len(report.errors),
            "warnings": len(report.warnings),
        },
    )
    return report