Skip to content

Other benchmarks

More details are coming soon!

Supported benchmarks

arena-hard

  • Benchmark is defined in nemo_skills/dataset/arena-hard/__init__.py
  • Original benchmark source is here.
  • Uses gpt-4.1 as the default judge model for evaluation.
  • Uses gpt-4-0314 model's responses as reference answers (baseline answers) for comparison.

Data Preparation

First, prepare the dataset by running the ns prepare_data command.

ns prepare_data arena-hard

Running the Evaluation

Once the data is prepared, you can run the evaluation. Replace <...> placeholders with your cluster and directory paths.

export OPENAI_API_KEY=<>
ns eval \
    --cluster=<CLUSTER_NAME> \
    --model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --server_type=vllm \
    --server_gpus=8 \
    --benchmarks=arena-hard \
    --output_dir=<OUTPUT_DIR> \
    ++parse_reasoning=True \
    ++inference.temperature=0.6 \
    ++inference.top_p=0.95 \
    ++inference.tokens_to_generate=32768

Verifying Results

After all jobs are complete, you can check the results in <OUTPUT_DIR>/eval-results/arena-hard/metrics.json.

------------------------------------------- arena-hard -------------------------------------------
evaluation_mode | num_entries | score  | 95_CI         | invalid_scores | avg_tokens | gen_seconds
pass@1          | 500         | 94.82% | (-0.67, 0.69) | 0              | 3878       | 230

arena-hard-v2

  • Benchmark is defined in nemo_skills/dataset/arena-hard-v2/__init__.py
  • Original benchmark source is here.
  • Uses gpt-4.1 as the default judge model for evaluation.
  • Uses o3-mini-2025-01-31 model's responses as reference answers (baseline answers) for comparison.

Data Preparation

First, prepare the dataset by running the ns prepare_data command.

ns prepare_data arena-hard-v2

Running the Evaluation

Once the data is prepared, you can run the evaluation. Here is an example command to reproduce Nemotron-3-Nano reported scores.

export OPENAI_API_KEY=<>
ns eval \
    --cluster=<CLUSTER_NAME> \
    --model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --server_type=vllm \
    --server_gpus=8 \
    --benchmarks=arena-hard-v2 \
    --output_dir=<OUTPUT_DIR> \
    ++parse_reasoning=True \
    ++inference.temperature=1.0 \
    ++inference.top_p=1.0 \
    ++inference.tokens_to_generate=65536 \
    --extra_judge_args=" \
        ++inference.tokens_to_generate=16000 \
        ++inference.temperature=0 \
        ++max_concurrent_requests=16"

Verifying Results

After all jobs are complete, you can check the results in <OUTPUT_DIR>/eval-results/arena-hard-v2/metrics.json.

{
  "arena-hard-v2": {
    "pass@1": {
      "num_entries": 750,
      "score": 70.25,
      "95_CI": [
        -1.58,
        1.15
      ],
      "invalid_scores": 1,
      "avg_tokens": 4197,
      "gen_seconds": 1371,
      "category_hard_prompt": {
        "num_entries": 500,
        "score": 74.02,
        "95_CI": [
          -1.88,
          1.66
        ],
        "invalid_scores": 0
      },
      "category_creative_writing": {
        "num_entries": 250,
        "score": 63.56,
        "95_CI": [
          -2.72,
          2.8
        ],
        "invalid_scores": 1
      }
    }
  }
}

HotpotQA

HotpotQA is a multi-hop question-answering benchmark that requires reasoning over multiple Wikipedia paragraphs. Two variants are supported:

Variant Slug Description
Distractor hotpotqa Model receives the question plus 10 context paragraphs (2 gold + 8 distractors) and must return the answer and identify supporting-fact sentences.
Closed-book hotpotqa_closedbook Same questions, no context provided — tests the model's parametric knowledge.

Data Preparation

Prepare the distractor validation set (single source of truth), then the closed-book variant (copies from it):

ns prepare_data hotpotqa
ns prepare_data hotpotqa_closedbook

You can also run ns prepare_data hotpotqa_closedbook alone; it will run the shared preparation for hotpotqa first if that data is not yet present, then copy it.

Running the Evaluation

Distractor evaluation (with context and supporting-fact scoring). Use hotpotqa:4 for 4 seeds (produces the example results below):

ns eval \
    --cluster=<CLUSTER_NAME> \
    --model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --server_type=vllm \
    --server_gpus=8 \
    --benchmarks=hotpotqa:4 \
    --output_dir=<OUTPUT_DIR> \
    --server_args="--max-model-len 32768" \
    ++inference.temperature=1.0 \
    ++inference.top_p=1.0 \
    ++inference.tokens_to_generate=16384

Closed-book evaluation (no context). Use hotpotqa_closedbook:4 for 4 seeds (produces the example results below):

ns eval \
    --cluster=<CLUSTER_NAME> \
    --model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --server_type=vllm \
    --server_gpus=8 \
    --benchmarks=hotpotqa_closedbook:4 \
    --output_dir=<OUTPUT_DIR> \
    --server_args="--max-model-len 32768" \
    ++inference.temperature=1.0 \
    ++inference.top_p=1.0 \
    ++inference.tokens_to_generate=16384

Verifying Results

After all jobs are complete, check the results in <OUTPUT_DIR>/eval-results/hotpotqa/metrics.json. The results table is printed to stdout and captured in the summarize-results srun log.

Example distractor results (Nemotron-3-Nano, hotpotqa:4):

----------------------------------------------------------------------------- hotpotqa -----------------------------------------------------------------------------
evaluation_mode           | num_entries | answer_em    | answer_f1    | sp_em        | sp_f1        | joint_em     | joint_f1     | is_correct   | is_correct_strict
pass@1[avg-of-4]          | 7405        | 62.92 ± 0.25 | 78.15 ± 0.16 | 21.52 ± 0.12 | 60.75 ± 0.21 | 15.45 ± 0.14 | 49.52 ± 0.15 | 73.35 ± 0.22 | 71.68 ± 0.26
pass@4                    | 7405        | 70.28        | 83.86        | 35.29        | 74.41        | 25.75        | 62.69        | 79.23        | 77.92
filtered_pass@1[avg-of-4] | 6057        | 67.71        | 79.30        | 22.09        | 60.95        | 17.01        | 50.56        | 78.79        | 77.12
filtered_pass@4           | 6057        | 74.95        | 85.10        | 35.86        | 74.55        | 27.92        | 63.88        | 84.27        | 83.11

Example closed-book results (Nemotron-3-Nano, hotpotqa_closedbook:4):

----------------------------------------- hotpotqa_closedbook ------------------------------------------
evaluation_mode           | num_entries | answer_em    | answer_f1    | is_correct   | is_correct_strict
pass@1[avg-of-4]          | 7405        | 29.05 ± 0.15 | 39.35 ± 0.18 | 33.14 ± 0.32 | 32.36 ± 0.28
pass@4                    | 7405        | 37.91        | 50.40        | 42.50        | 41.30
filtered_pass@1[avg-of-4] | 6057        | 31.85        | 39.57        | 36.48        | 35.60
filtered_pass@4           | 6057        | 41.59        | 51.01        | 46.77        | 45.44

The closed-book variant reports answer-level metrics only (no supporting-fact or joint metrics).

AA-Omniscience

This is a benchmark developed by AA to measure hallucinations in LLMs and penalize confidently-false answers.

Eval Results:

Model Accuracy Omni-Index Hallucination Rate
Qwen3-8B (Reported) 12.73% -66 90.36%
Qwen3-8B (Measured) 15.17% -64.83 94.30%

Notes:

  • Note that this benchmark can be quite sensitive to temperature and other sampling parameters, so make sure your settings align well with downstream conditions.
  • Also note that there still may be some variance between the public set and the full dataset; however, this set may be used as a way to compare hallucination rates between different checkpoints/models.

Configuration: Qwen3-8B with default judge (gemini-2.5-flash-preview-09-2025)

  • Make sure to set DEFAULT_REASONING_EFFORT_HIGH_THINKING_BUDGET=24576 in your environment variables and nemo-skills config to max judge reasoning when using reasoning_effort='high'.
from nemo_skills.pipeline.cli import wrap_arguments, eval

eval(
    ctx=wrap_arguments(
        f"++inference.temperature=0.6 "
        f"++inference.top_p=1.0 "
        f"++inference.top_k=-1 "
        f"++inference.tokens_to_generate=131072 "
        f"++inference.reasoning_effort='high' "
    ),
    cluster="slurm",
    expname="aa-omniscience-eval",
    model="Qwen/Qwen3-8B",
    server_gpus=8,
    server_nodes=1,
    server_type="vllm",
    server_args="--async-scheduling",
    benchmarks="omniscience",
    output_dir="/workspace/experiments/aa-omniscience-eval",
    data_dir="/workspace/data_dir",
    extra_judge_args="++inference.reasoning_effort='high' ++inference.temperature=1.0 ++inference.top_p=0.95 ++inference.top_k=64 " # set max reasoning effort and default temp for judge
)

Configuration: Qwen3-8B with custom judge (gpt-oss-120b)

from nemo_skills.pipeline.cli import wrap_arguments, eval

eval(
    ctx=wrap_arguments(
        f"++inference.temperature=0.6 "
        f"++inference.top_p=1.0 "
        f"++inference.top_k=-1 "
        f"++inference.tokens_to_generate=131072 "
        f"++inference.reasoning_effort='high' "
    ),
    cluster="slurm",
    expname="aa-omniscience-eval",
    model="Qwen/Qwen3-8B",
    server_gpus=8,
    server_nodes=1,
    server_type="vllm",
    server_args="--async-scheduling",
    judge_model="openai/gpt-oss-120b",
    judge_server_type="vllm",
    judge_server_gpus=8,
    judge_server_args="--async-scheduling  --reasoning-parser GptOss",
    benchmarks="omniscience",
    output_dir="/workspace/experiments/aa-omniscience-eval",
    data_dir="/workspace/data_dir"
)