Stage 3: Evaluation#
This stage evaluates trained models using NeMo Evaluator, running standard NLP benchmarks to measure model capabilities.
Quick Start#
// Run evaluation on the RL model (default)
$ uv run nemotron nano3 eval --run YOUR-CLUSTER
// Evaluate a specific model stage
$ uv run nemotron nano3 eval --run YOUR-CLUSTER run.model=sft:latest
// Run specific benchmarks only
$ uv run nemotron nano3 eval --run YOUR-CLUSTER -t adlr_mmlu -t hellaswag
// Preview config without executing
$ uv run nemotron nano3 eval --dry-run
Note: The
--run YOUR-CLUSTERflag submits jobs via NeMo-Run. See Execution through NeMo-Run for setup.
CLI Command#
uv run nemotron nano3 eval [options] [overrides...]
Option |
Short |
Description |
|---|---|---|
|
|
Submit to cluster (attached—waits, streams logs) |
|
|
Submit to cluster (detached—submits and exits) |
|
|
Preview config without executing |
|
|
Filter to specific task(s), can be repeated |
|
Force re-squash of container image |
|
|
Override config values |
Task Filtering#
Run specific benchmarks using the -t flag:
# Single task
uv run nemotron nano3 eval --run YOUR-CLUSTER -t adlr_mmlu
# Multiple tasks
uv run nemotron nano3 eval --run YOUR-CLUSTER -t adlr_mmlu -t hellaswag -t arc_challenge
Model Selection#
By default, evaluation runs on the RL stage output (run.model=rl:latest). Override to evaluate other stages:
# Evaluate SFT model
uv run nemotron nano3 eval --run YOUR-CLUSTER run.model=sft:latest
# Evaluate pretrained model
uv run nemotron nano3 eval --run YOUR-CLUSTER run.model=pretrain:latest
# Evaluate specific version
uv run nemotron nano3 eval --run YOUR-CLUSTER run.model=rl:v2
Available Benchmarks#
The default configuration includes these tasks:
Task |
Description |
|---|---|
|
Massive Multitask Language Understanding |
|
Commonsense reasoning |
|
AI2 Reasoning Challenge |
Additional tasks available in NeMo Evaluator:
Task |
Description |
|---|---|
|
ARC Challenge (25-shot) |
|
Winograd Schema Challenge |
|
Open-domain question answering |
|
Truthfulness evaluation |
|
Grade school math |
See NeMo Evaluator for the full list of available tasks.
Configuration#
Evaluation configs define how to deploy your model and which benchmarks to run.
File |
Purpose |
|---|---|
|
Production configuration with vLLM deployment |
Key Configuration Sections#
# Model to evaluate (W&B artifact reference)
run:
model: rl:latest # Options: pretrain, sft, rl
# Deployment (model serving)
deployment:
type: vllm
tensor_parallel_size: 4
data_parallel_size: 1
extra_args: "--max-model-len 32768"
# Tasks to run
evaluation:
tasks:
- name: adlr_mmlu
- name: hellaswag
- name: arc_challenge
# W&B export for results
export:
wandb:
entity: ${run.wandb.entity}
project: ${run.wandb.project}
Override Examples#
# Different tensor parallelism
uv run nemotron nano3 eval --run YOUR-CLUSTER deployment.tensor_parallel_size=8
# Limit samples for quick testing
uv run nemotron nano3 eval --run YOUR-CLUSTER \
evaluation.nemo_evaluator_config.config.params.limit_samples=10
Running with NeMo-Run#
The evaluator uses the same env.toml profiles as training stages, providing a unified experience across the pipeline.
[wandb]
project = "nemotron"
entity = "YOUR-TEAM"
[YOUR-CLUSTER]
executor = "slurm"
account = "YOUR-ACCOUNT"
partition = "batch"
tunnel = "ssh"
host = "cluster.example.com"
user = "myuser"
remote_job_dir = "/lustre/fsw/users/myuser/.nemotron"
See Execution through NeMo-Run for complete configuration options.
W&B Integration#
Results are automatically exported to Weights & Biases when:
You’re logged in locally (
wandb login)[wandb]section is configured inenv.toml
# Verify W&B login
wandb login
# Run evaluation—results auto-export to W&B
uv run nemotron nano3 eval --run YOUR-CLUSTER
# [info] Detected W&B login, setting WANDB_API_KEY
Artifact Lineage#
Evaluation connects to the training pipeline through W&B Artifacts:
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart LR
model0["ModelArtifact-pretrain"] --> eval0["eval"]
model1["ModelArtifact-sft"] --> eval1["eval"]
model2["ModelArtifact-rl"] --> eval2["eval"]
eval0 --> results0["Benchmark Results"]
eval1 --> results1["Benchmark Results"]
eval2 --> results2["Benchmark Results"]
results0 --> wandb["W&B Dashboard"]
results1 --> wandb
results2 --> wandb
style model0 fill:#e1f5fe,stroke:#2196f3
style model1 fill:#f3e5f5,stroke:#9c27b0
style model2 fill:#e8f5e9,stroke:#4caf50
style wandb fill:#fff3e0,stroke:#ff9800
Monitoring Jobs#
Check Status#
# Using nemo-evaluator-launcher
nemo-evaluator-launcher status INVOCATION_ID
# Check Slurm queue
ssh cluster squeue -u $USER
Stream Logs#
nemo-evaluator-launcher logs INVOCATION_ID
Troubleshooting#
W&B Credentials Not Detected#
Verify you’re logged in:
wandb loginCheck env.toml has
[wandb]sectionLook for
[info] Detected W&B loginmessage
Model Artifact Not Found#
Verify the artifact exists in W&B:
# Check available artifacts
wandb artifact ls YOUR-ENTITY/YOUR-PROJECT
Evaluation Times Out#
Increase the timeout in your config:
uv run nemotron nano3 eval --run YOUR-CLUSTER \
evaluation.nemo_evaluator_config.config.params.request_timeout=7200
Reference#
Evaluation Framework — Full evaluator documentation
NeMo Evaluator Documentation — Launcher reference
Artifact Lineage — W&B artifact system
Execution through NeMo-Run — Execution profiles
W&B Integration — Credentials and configuration
Recipe Source:
src/nemotron/recipes/nano3/stage3_eval/— Implementation details