CLI Framework#
The nemotron.kit CLI framework is built on Typer and provides tools for building hierarchical command-line interfaces for training recipes, with native integration with NeMo-Run for remote execution.
$ uv run nemotron nano3 sft --help
Usage: nemotron nano3 sft [OPTIONS]
Run supervised fine-tuning with Megatron-Bridge (stage1).
╭─ Options ────────────────────────────────────────────────────────────────╮
│ --help -h Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────╯
╭─ Global Options ─────────────────────────────────────────────────────────╮
│ -c, --config NAME Config name or path │
│ -r, --run PROFILE Submit to cluster (attached) │
│ -b, --batch PROFILE Submit to cluster (detached) │
│ -d, --dry-run Preview config without execution │
│ --stage Stage files for interactive debugging │
╰──────────────────────────────────────────────────────────────────────────╯
╭─ Configs (-c/--config) ──────────────────────────────────────────────────╮
│ Built-in: default, tiny │
│ Custom: -c /path/to/your/config.yaml │
╰──────────────────────────────────────────────────────────────────────────╯
╭─ Artifact Overrides (W&B artifact references) ───────────────────────────╮
│ run.model Base model checkpoint artifact │
│ run.data SFT data artifact (packed .npy) │
╰──────────────────────────────────────────────────────────────────────────╯
╭─ Run Overrides (override env.toml settings) ─────────────────────────────╮
│ run.env.nodes Number of nodes │
│ run.env.nproc_per_node GPUs per node │
│ run.env.partition Slurm partition │
│ run.env.account Slurm account │
│ run.env.time Job time limit (e.g., 04:00:00) │
│ run.env.container_image Override container image │
╰──────────────────────────────────────────────────────────────────────────╯
╭─ env.toml Profiles ──────────────────────────────────────────────────────╮
│ Available profiles: my-cluster, my-cluster-large │
│ Usage: --run PROFILE or --batch PROFILE │
╰──────────────────────────────────────────────────────────────────────────╯
╭─ Examples ───────────────────────────────────────────────────────────────╮
│ $ ... sft -c tiny Local execution │
│ $ ... sft -c tiny --dry-run Preview config │
│ $ ... sft -c tiny --run my-cluster Submit to cluster │
│ $ ... sft -c tiny -r cluster run.env.nodes=4 │
╰──────────────────────────────────────────────────────────────────────────╯
Overview#
The CLI framework enables:
Nested Commands — Build hierarchical CLIs like
uv run nemotron nano3 data prep pretrainConfig Integration — Automatic YAML config loading with dotlist overrides
Artifact Resolution — Map W&B artifacts to config fields automatically
Remote Execution — Submit jobs to Slurm via NeMo-Run with
--run/--batch
For artifacts and configuration, see Nemotron Kit. For execution profiles, see Execution through NeMo-Run.
Architecture#
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333', 'clusterBkg': '#f5f5f5', 'clusterBorder': '#666666'}}}%%
flowchart LR
subgraph cli["CLI Layer (Typer)"]
Root["nemotron"]
Recipe["nano3"]
Commands["pretrain/sft/rl"]
end
subgraph config["Configuration"]
YAML["YAML Config"]
Dotlist["Dotlist Overrides"]
Artifacts["Artifact Resolution"]
end
subgraph execution["Execution Modes"]
Local["Local (torchrun)"]
NemoRun["NeMo-Run"]
Ray["Ray Jobs"]
end
Root --> Recipe --> Commands
Commands --> config
config --> execution
The @recipe Decorator#
Commands are defined using the @recipe decorator, which wraps Typer commands with standardized config loading and execution logic:
from nemotron.kit.cli.recipe import recipe
import typer
@recipe(
name="nano3/pretrain",
script_path="src/nemotron/recipes/nano3/stage0_pretrain/train.py",
config_dir="src/nemotron/recipes/nano3/stage0_pretrain/config",
default_config="default",
packager="self_contained",
torchrun=True,
ray=False,
artifacts={
"data": {
"default": "PretrainBlendsArtifact-default",
"mappings": {"path": "recipe.per_split_data_args_path"},
},
},
)
def pretrain(ctx: typer.Context) -> None:
"""Run pretraining with Megatron-Bridge."""
pass # Execution handled by decorator
Decorator Parameters#
Parameter |
Type |
Description |
|---|---|---|
|
|
Recipe identifier (e.g., |
|
|
Path to the training script |
|
|
Directory containing YAML configs |
|
|
Default config name (without |
|
|
Code packaging strategy: |
|
|
Use |
|
|
Submit as Ray job (for data prep, RL) |
|
|
Artifact-to-config mappings |
|
|
Custom command template for Ray jobs |
Registering Commands#
Commands are registered on Typer apps with specific context settings:
nano3_app = typer.Typer(name="nano3", help="Nano3 training recipe")
nano3_app.command(
name="pretrain",
context_settings={
"allow_extra_args": True, # Capture dotlist overrides
"ignore_unknown_options": True, # Pass through unknown flags
},
)(pretrain)
The allow_extra_args=True setting is critical—it allows commands to capture Hydra-style key=value overrides.
Global Options#
All recipe commands automatically receive these global options:
Option |
Short |
Description |
|---|---|---|
|
|
Config name or path (from |
|
|
Attached NeMo-Run execution (waits, streams logs) |
|
|
Detached NeMo-Run execution (submits, exits) |
|
|
Preview config without executing |
|
Stage files to remote for debugging |
|
|
Dotlist overrides (any position) |
GlobalContext#
Global options are captured in a GlobalContext dataclass:
@dataclass
class GlobalContext:
config: str | None = None # -c/--config value
run: str | None = None # --run profile name
batch: str | None = None # --batch profile name
dry_run: bool = False # --dry-run flag
stage: bool = False # --stage flag
dotlist: list[str] # key=value overrides
passthrough: list[str] # Unknown args for script
Key properties:
mode→"run","batch", or"local"profile→ Environment profile name (from--runor--batch)
Configuration Pipeline#
The ConfigBuilder class orchestrates config loading:
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart LR
Default["default.yaml"] --> Merge
Config["--config"] --> Merge
Dotlist["key=value"] --> Merge
Merge --> JobConfig["job.yaml"]
Merge --> TrainConfig["train.yaml"]
Two-Config System#
The CLI generates two config files:
File |
Purpose |
|---|---|
|
Full provenance: config + CLI args + env profile |
|
Clean config for script (paths rewritten for remote) |
job.yaml structure:
recipe:
_target_: megatron.bridge.recipes...
per_split_data_args_path: /data/blend.json
train:
train_iters: 1000
run:
mode: "run"
profile: "my-cluster"
env:
executor: "slurm"
nodes: 4
gpus_per_node: 8
cli:
argv: ["nemotron", "nano3", "pretrain", "-c", "tiny", "--run", "my-cluster"]
dotlist: ["train.train_iters=1000"]
wandb:
entity: "nvidia"
project: "nemotron"
Dotlist Overrides#
Override any config value with key.path=value syntax:
# Override nested values
uv run nemotron nano3 pretrain train.train_iters=5000
# Multiple overrides
uv run nemotron nano3 pretrain \
train.train_iters=5000 \
train.micro_batch_size=2 \
run.data=PretrainBlendsArtifact-v2:latest
Execution Modes#
Local Execution#
Without --run or --batch, commands execute locally:
# Local execution (no NeMo-Run)
uv run nemotron nano3 pretrain -c tiny
# Equivalent to:
python -m torch.distributed.run \
--nproc_per_node=1 \
src/nemotron/recipes/nano3/stage0_pretrain/train.py \
--config train.yaml
NeMo-Run Attached (--run)#
Submit job and wait for completion, streaming logs:
uv run nemotron nano3 pretrain -c tiny --run MY-CLUSTER
NeMo-Run Detached (--batch)#
Submit job and exit immediately:
uv run nemotron nano3 pretrain -c tiny --batch MY-CLUSTER
Ray Jobs#
For recipes with ray=True (data prep, RL), jobs are submitted via Ray:
# Data prep uses Ray for distributed processing
uv run nemotron nano3 data prep pretrain --run MY-CLUSTER
# RL uses Ray for actor orchestration
uv run nemotron nano3 rl -c tiny --run MY-CLUSTER
Artifact Inputs#
Map W&B artifacts to config fields:
@recipe(
...,
artifacts={
"data": {
"default": "PretrainBlendsArtifact-default:latest",
"mappings": {"path": "recipe.per_split_data_args_path"},
},
"model": {
"default": "ModelArtifact-sft:latest",
"mappings": {"path": "model.init_from_path"},
},
},
)
CLI Override#
Override artifacts via dotlist:
uv run nemotron nano3 sft --run MY-CLUSTER \
run.data=PretrainBlendsArtifact-v2:latest \
run.model=ModelArtifact-pretrain:v3
Config Resolver#
Use ${art:...} in YAML configs:
run:
data: PretrainBlendsArtifact-default:latest
recipe:
per_split_data_args_path: ${art:data,path}/blend.json
Packager Types#
Control how code is synced to remote:
Packager |
Description |
Use Case |
|---|---|---|
|
Minimal sync ( |
Default |
|
Full codebase with exclusions |
Ray jobs needing imports |
|
Inline all |
Isolated scripts |
CLI Examples#
// Preview config without executing
$ uv run nemotron nano3 pretrain -c tiny --dry-run
// Submit to cluster (attached)
$ uv run nemotron nano3 pretrain -c tiny --run MY-CLUSTER
// Submit to cluster (detached)
$ uv run nemotron nano3 pretrain -c tiny --batch MY-CLUSTER
// Override training iterations
$ uv run nemotron nano3 pretrain -c tiny --run MY-CLUSTER train.train_iters=5000
// Stage files for interactive debugging
$ uv run nemotron nano3 pretrain -c tiny --run MY-CLUSTER --stage
// Data preparation (Ray job)
$ uv run nemotron nano3 data prep pretrain --run MY-CLUSTER
// RL training (Ray job)
$ uv run nemotron nano3 rl -c tiny --run MY-CLUSTER
Building a Recipe#
Step 1: Create Config Directory#
src/nemotron/recipes/myrecipe/
├── config/
│ ├── default.yaml
│ └── tiny.yaml
├── train.py
└── data_prep.py
Step 2: Define Training Script#
# train.py
import argparse
from pathlib import Path
from omegaconf import OmegaConf
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--config", type=Path, required=True)
args, unknown = parser.parse_known_args()
# Load config
cfg = OmegaConf.load(args.config)
# Apply any remaining overrides
if unknown:
overrides = OmegaConf.from_dotlist(unknown)
cfg = OmegaConf.merge(cfg, overrides)
# Run training...
print(f"Training with {cfg.train.train_iters} iterations")
if __name__ == "__main__":
main()
Step 3: Create CLI Command#
# src/nemotron/cli/myrecipe/train.py
from nemotron.kit.cli.recipe import recipe
import typer
@recipe(
name="myrecipe/train",
script_path="src/nemotron/recipes/myrecipe/train.py",
config_dir="src/nemotron/recipes/myrecipe/config",
default_config="default",
torchrun=True,
)
def train(ctx: typer.Context) -> None:
"""Run training for my recipe."""
pass
Step 4: Register in CLI#
# src/nemotron/cli/myrecipe/__init__.py
import typer
from .train import train
app = typer.Typer(name="myrecipe", help="My training recipe")
app.command(
name="train",
context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
)(train)
Step 5: Add to Main CLI#
# src/nemotron/cli/bin/nemotron.py
from nemotron.cli.myrecipe import app as myrecipe_app
main_app.add_typer(myrecipe_app, name="myrecipe")
Step 6: Run#
# Test locally
uv run nemotron myrecipe train -c tiny
# Run on cluster
uv run nemotron myrecipe train -c tiny --run MY-CLUSTER
API Reference#
Recipe Decorator#
Export |
Description |
|---|---|
|
Decorator for training commands |
|
Config loading and merging |
|
Shared CLI state |
|
Parse dotlist vs passthrough args |
Execution#
Export |
Description |
|---|---|
|
Create NeMo-Run executor from profile |
|
Load profile from |
Further Reading#
Nemotron Kit — Artifacts, configuration, lineage tracking
Execution through NeMo-Run — Execution profiles and env.toml
Data Preparation — Data preparation module
Artifact Lineage — W&B artifact system and lineage tracking
W&B Integration — Credentials and configuration
Nano3 Recipe — Complete training recipe example