Evaluation Framework#
The Nemotron evaluation framework provides model evaluation capabilities using NeMo Evaluator, enabling benchmark testing of trained models on standard NLP tasks.
$ uv run nemotron evaluate -c nemotron-3-nano-nemo-ray --run MY-CLUSTER
Compiled Configuration
╭──────────────────────────────────── run ─────────────────────────────────────╮
│ wandb: │
│ project: nemotron │
│ entity: my-team │
╰──────────────────────────────────────────────────────────────────────────────╯
[info] Detected W&B login, setting WANDB_API_KEY
Starting evaluation...
✓ Evaluation submitted: 480d3c89bfe4a55c
Check status: nemo-evaluator-launcher status 480d3c89bfe4a55c
Overview#
The evaluation framework enables:
Benchmark Testing — Run standard benchmarks (MMLU, ARC, HellaSwag, etc.) on your models
W&B Integration — Auto-export results to Weights & Biases for tracking
Slurm Execution — Submit evaluation jobs to HPC clusters
Auto-Squash — Automatically converts Docker images to squashfs for Slurm clusters
Credential Auto-Propagation — Automatically passes W&B tokens to remote jobs
The evaluator uses the same env.toml execution profiles as training recipes, providing a unified experience across all stages.
Quick Start#
# Run evaluation on a cluster
uv run nemotron evaluate -c nemotron-3-nano-nemo-ray --run MY-CLUSTER
# Preview config without executing
uv run nemotron evaluate -c nemotron-3-nano-nemo-ray --dry-run
# Filter to specific tasks
uv run nemotron evaluate -c nemotron-3-nano-nemo-ray --run MY-CLUSTER -t adlr_mmlu
# Override checkpoint path
uv run nemotron evaluate -c nemotron-3-nano-nemo-ray --run MY-CLUSTER \
deployment.checkpoint_path=/path/to/your/checkpoint
CLI Options#
Option |
Short |
Description |
|---|---|---|
|
|
Config name or path |
|
|
Submit to cluster (attached, streams logs) |
|
|
Submit to cluster (detached, exits immediately) |
|
|
Preview config without executing |
|
|
Filter to specific task(s), can be repeated |
|
Force re-squash even if cached |
Task Filtering#
Run specific benchmarks using the -t flag:
# Single task
uv run nemotron evaluate -c config --run MY-CLUSTER -t adlr_mmlu
# Multiple tasks
uv run nemotron evaluate -c config --run MY-CLUSTER -t adlr_mmlu -t hellaswag
Available Tasks#
Common evaluation tasks include:
Task |
Description |
|---|---|
|
Massive Multitask Language Understanding |
|
AI2 Reasoning Challenge |
|
Winograd Schema Challenge |
|
Commonsense reasoning |
|
Open-domain question answering |
Execution Profiles#
The evaluator uses the same env.toml profiles as training recipes. See Execution through NeMo-Run for full documentation.
Basic Profile#
# env.toml
[wandb]
project = "nemotron"
entity = "my-team"
[MY-CLUSTER]
executor = "slurm"
account = "my-account"
partition = "batch"
tunnel = "ssh"
host = "cluster.example.com"
user = "myuser"
remote_job_dir = "/lustre/fsw/users/myuser/.nemotron"
Profile with Auto-Squash#
Slurm clusters use Pyxis with enroot for container execution. While you can use Docker references directly, pre-squashed .sqsh files significantly speed up job startup by avoiding container pulls on each run.
With SSH tunnel settings, the CLI can automatically create squash files from Docker references:
[MY-CLUSTER]
executor = "slurm"
account = "my-account"
partition = "batch"
# SSH settings (enables auto-squash)
tunnel = "ssh"
host = "cluster.example.com"
user = "myuser"
remote_job_dir = "/lustre/fsw/users/myuser/.nemotron"
# Container settings - use Docker ref, auto-squashed on first run
container_image = "nvcr.io/nvidia/nemo:25.01"
When you run with --run MY-CLUSTER, the CLI will:
Detect that
deployment.imageis a Docker reference (not a.sqshpath)SSH to the cluster and run
enroot importon a compute nodeCache the
.sqshfile in${remote_job_dir}/containers/for reuseUpdate the config to use the squashed path
Subsequent runs reuse the cached squash file, eliminating container pull overhead.
Configuration#
Evaluation configs define how to deploy your model and which benchmarks to run.
Example Config#
# Execution (Slurm settings)
execution:
type: slurm
hostname: ${run.env.host}
account: ${run.env.account}
partition: ${run.env.partition}
num_nodes: 1
gres: gpu:8
# Auto-export to W&B after evaluation
auto_export:
enabled: true
destinations:
- wandb
# Deployment (Model serving)
deployment:
type: generic
image: ${run.env.container} # Docker image or .sqsh path
checkpoint_path: /path/to/checkpoint
command: >-
python deploy_ray_inframework.py
--megatron_checkpoint /checkpoint/
--num_gpus 8
# Evaluation (Tasks to run)
evaluation:
tasks:
- name: adlr_mmlu
- name: hellaswag
- name: openbookqa
# Export (W&B settings)
export:
wandb:
entity: ${run.wandb.entity}
project: ${run.wandb.project}
Key Sections#
Section |
Purpose |
|---|---|
|
Environment settings from env.toml (cluster, container) |
|
W&B settings from env.toml |
|
Slurm executor configuration (nodes, GPUs, account) |
|
Model deployment (container, checkpoint, command) |
|
Tasks and evaluation parameters |
|
Result export destinations (W&B) |
Auto-Squash#
For Slurm clusters that require squashfs containers, the evaluator automatically converts Docker images to .sqsh files—the same behavior as training recipes.
How It Works#
Detection — CLI checks if
deployment.imageis a Docker reference (not already.sqsh)SSH Connection — Connects to cluster via SSH tunnel (using
hostanduserfrom env.toml)Squash — Runs
enroot importon a compute node to create the.sqshfileCache — Stores the squash file in
${remote_job_dir}/containers/for reuseConfig Update — Rewrites
deployment.imageto use the squashed path
Usage#
# Auto-squash happens automatically for Docker refs
uv run nemotron evaluate -c config --run MY-CLUSTER
# Force re-squash (ignores cache)
uv run nemotron evaluate -c config --run MY-CLUSTER --force-squash
# Already-squashed paths skip the step
# (if deployment.image ends in .sqsh, no squashing needed)
Requirements#
Auto-squash requires these settings in your env.toml profile:
Field |
Required |
Description |
|---|---|---|
|
Yes |
Must be |
|
Yes |
Must be |
|
Yes |
SSH hostname (e.g., |
|
No |
SSH username (defaults to current user) |
|
Yes |
Remote directory for job files and squash cache |
W&B Integration#
The evaluator automatically propagates W&B credentials when you’re logged in locally—the same behavior as training recipes.
Setup#
Login to W&B locally:
wandb loginConfigure env.toml (same
[wandb]section used by all recipes):[wandb] project = "nemotron" entity = "my-team"
Run evaluation — credentials are automatically passed:
uv run nemotron evaluate -c config --run MY-CLUSTER # [info] Detected W&B login, setting WANDB_API_KEY
What Gets Propagated#
Variable |
Source |
Description |
|---|---|---|
|
Local wandb login |
Auto-detected via |
|
|
Project name for result tracking |
|
|
Team/user entity |
Monitoring Jobs#
Check Status#
# Using nemo-evaluator-launcher directly
nemo-evaluator-launcher status INVOCATION_ID
# Check Slurm queue
ssh cluster squeue -u $USER
Stream Logs#
nemo-evaluator-launcher logs INVOCATION_ID
Cancel Jobs#
# Cancel via Slurm
ssh cluster scancel JOB_ID
# Or multiple jobs
ssh cluster "scancel JOB_ID1 JOB_ID2 JOB_ID3"
Creating Custom Configs#
Step 1: Create Config File#
# src/nemotron/recipes/evaluator/config/my-model.yaml
defaults:
- execution: slurm/default
- deployment: generic
- _self_
run:
env:
container: nvcr.io/nvidia/nemo:25.01 # Docker ref (auto-squashed)
# OR: container: /path/to/container.sqsh # Pre-squashed
wandb:
entity: null # Populated from env.toml
project: null
execution:
type: slurm
hostname: ${run.env.host}
account: ${run.env.account}
num_nodes: 1
gres: gpu:8
auto_export:
enabled: true
destinations:
- wandb
deployment:
type: generic
image: ${run.env.container}
checkpoint_path: /path/to/your/model/checkpoint
command: >-
python deploy_script.py --checkpoint /checkpoint/
evaluation:
tasks:
- name: adlr_mmlu
- name: hellaswag
export:
wandb:
entity: ${run.wandb.entity}
project: ${run.wandb.project}
Step 2: Run Evaluation#
uv run nemotron evaluate -c my-model --run MY-CLUSTER
Troubleshooting#
“Missing key type” Error#
Ensure your config has all required Slurm fields:
execution:
type: slurm # Required
ntasks_per_node: 1 # Required
gres: gpu:8 # Required
W&B Credentials Not Detected#
Verify you’re logged in:
wandb loginCheck env.toml has
[wandb]sectionLook for
[info] Detected W&B loginmessage
Auto-Squash Not Working#
Verify
tunnel = "ssh"in your env.toml profileCheck
hostandremote_job_dirare setEnsure
nemo-runis installed:pip install nemo-run
Jobs Stuck in PENDING#
Check queue status:
ssh cluster "squeue -p batch | head"
Common reasons:
(Priority)— Waiting for resources(Resources)— Insufficient available nodes(QOSMaxJobsPerUserLimit)— User job limit reached
Further Reading#
Execution through NeMo-Run — Execution profiles and env.toml
W&B Integration — Credentials and artifact tracking
NeMo Evaluator Documentation — Launcher reference