Stage 3: Evaluation#

This stage evaluates trained models using NeMo Evaluator, running standard NLP benchmarks to measure model capabilities.


Quick Start#

// Run evaluation on the RL model (default)
$ uv run nemotron nano3 eval --run YOUR-CLUSTER

// Evaluate a specific model stage
$ uv run nemotron nano3 eval --run YOUR-CLUSTER run.model=sft:latest

// Run specific benchmarks only
$ uv run nemotron nano3 eval --run YOUR-CLUSTER -t adlr_mmlu -t hellaswag

// Preview config without executing
$ uv run nemotron nano3 eval --dry-run

Note: The --run YOUR-CLUSTER flag submits jobs via NeMo-Run. See Execution through NeMo-Run for setup.


CLI Command#

uv run nemotron nano3 eval [options] [overrides...]

Option

Short

Description

--run <profile>

-r

Submit to cluster (attached—waits, streams logs)

--batch <profile>

-b

Submit to cluster (detached—submits and exits)

--dry-run

-d

Preview config without executing

--task <name>

-t

Filter to specific task(s), can be repeated

--force-squash

Force re-squash of container image

key=value

Override config values

Task Filtering#

Run specific benchmarks using the -t flag:

# Single task
uv run nemotron nano3 eval --run YOUR-CLUSTER -t adlr_mmlu

# Multiple tasks
uv run nemotron nano3 eval --run YOUR-CLUSTER -t adlr_mmlu -t hellaswag -t arc_challenge

Model Selection#

By default, evaluation runs on the RL stage output (run.model=rl:latest). Override to evaluate other stages:

# Evaluate SFT model
uv run nemotron nano3 eval --run YOUR-CLUSTER run.model=sft:latest

# Evaluate pretrained model
uv run nemotron nano3 eval --run YOUR-CLUSTER run.model=pretrain:latest

# Evaluate specific version
uv run nemotron nano3 eval --run YOUR-CLUSTER run.model=rl:v2

Available Benchmarks#

The default configuration includes these tasks:

Task

Description

adlr_mmlu

Massive Multitask Language Understanding

hellaswag

Commonsense reasoning

arc_challenge

AI2 Reasoning Challenge

Additional tasks available in NeMo Evaluator:

Task

Description

adlr_arc_challenge_llama_25_shot

ARC Challenge (25-shot)

adlr_winogrande_5_shot

Winograd Schema Challenge

openbookqa

Open-domain question answering

truthfulqa

Truthfulness evaluation

gsm8k

Grade school math

See NeMo Evaluator for the full list of available tasks.


Configuration#

Evaluation configs define how to deploy your model and which benchmarks to run.

File

Purpose

config/default.yaml

Production configuration with vLLM deployment

Key Configuration Sections#

# Model to evaluate (W&B artifact reference)
run:
  model: rl:latest  # Options: pretrain, sft, rl

# Deployment (model serving)
deployment:
  type: vllm
  tensor_parallel_size: 4
  data_parallel_size: 1
  extra_args: "--max-model-len 32768"

# Tasks to run
evaluation:
  tasks:
    - name: adlr_mmlu
    - name: hellaswag
    - name: arc_challenge

# W&B export for results
export:
  wandb:
    entity: ${run.wandb.entity}
    project: ${run.wandb.project}

Override Examples#

# Different tensor parallelism
uv run nemotron nano3 eval --run YOUR-CLUSTER deployment.tensor_parallel_size=8

# Limit samples for quick testing
uv run nemotron nano3 eval --run YOUR-CLUSTER \
    evaluation.nemo_evaluator_config.config.params.limit_samples=10

Running with NeMo-Run#

The evaluator uses the same env.toml profiles as training stages, providing a unified experience across the pipeline.

[wandb]
project = "nemotron"
entity = "YOUR-TEAM"

[YOUR-CLUSTER]
executor = "slurm"
account = "YOUR-ACCOUNT"
partition = "batch"
tunnel = "ssh"
host = "cluster.example.com"
user = "myuser"
remote_job_dir = "/lustre/fsw/users/myuser/.nemotron"

See Execution through NeMo-Run for complete configuration options.

W&B Integration#

Results are automatically exported to Weights & Biases when:

  1. You’re logged in locally (wandb login)

  2. [wandb] section is configured in env.toml

# Verify W&B login
wandb login

# Run evaluation—results auto-export to W&B
uv run nemotron nano3 eval --run YOUR-CLUSTER
# [info] Detected W&B login, setting WANDB_API_KEY

Artifact Lineage#

Evaluation connects to the training pipeline through W&B Artifacts:

        %%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart LR
    model0["ModelArtifact-pretrain"] --> eval0["eval"]
    model1["ModelArtifact-sft"] --> eval1["eval"]
    model2["ModelArtifact-rl"] --> eval2["eval"]

    eval0 --> results0["Benchmark Results"]
    eval1 --> results1["Benchmark Results"]
    eval2 --> results2["Benchmark Results"]

    results0 --> wandb["W&B Dashboard"]
    results1 --> wandb
    results2 --> wandb

    style model0 fill:#e1f5fe,stroke:#2196f3
    style model1 fill:#f3e5f5,stroke:#9c27b0
    style model2 fill:#e8f5e9,stroke:#4caf50
    style wandb fill:#fff3e0,stroke:#ff9800
    

Monitoring Jobs#

Check Status#

# Using nemo-evaluator-launcher
nemo-evaluator-launcher status INVOCATION_ID

# Check Slurm queue
ssh cluster squeue -u $USER

Stream Logs#

nemo-evaluator-launcher logs INVOCATION_ID

Troubleshooting#

W&B Credentials Not Detected#

  1. Verify you’re logged in: wandb login

  2. Check env.toml has [wandb] section

  3. Look for [info] Detected W&B login message

Model Artifact Not Found#

Verify the artifact exists in W&B:

# Check available artifacts
wandb artifact ls YOUR-ENTITY/YOUR-PROJECT

Evaluation Times Out#

Increase the timeout in your config:

uv run nemotron nano3 eval --run YOUR-CLUSTER \
    evaluation.nemo_evaluator_config.config.params.request_timeout=7200

Reference#