Skip to content

Scientific Knowledge

Nemo-Skills can be used to evaluate an LLM on various STEM datasets.

Dataset Overview

Dataset
Questions
Types
Domain
Images?
NS default
HLE 2,500 Open ended, MCQ Engineering, Physics, Chemistry, Bio, etc. Yes text only
HLE-Verified 2,500 Open ended, MCQ Engineering, Physics, Chemistry, Bio, etc. Yes gold+revision text only
GPQA 448 (main)
198 (diamond)
546 (ext.)
MCQ (4) Physics, Chemistry, Biology No diamond
SuperGPQA 26,529 MCQ (≤ 10) Science, Eng, Humanities, etc. No test
MMLU-Pro 12,032 MCQ (≤ 10) Multiple subjects No test
SciCode 80
(338 subtasks)
Code gen Scientific computing No test+val
FrontierScience 100 Short-answer Physics, Chemistry, Biology No all
Physics 1,000 (EN), 1,000 (ZH) Open-ended Physics No EN
UGPhysics 5,520 (EN), 5,520 (ZH) Open-ended MCQ Physics No EN
MMLU 14,042 MCQ (4) Multiple Subjects No test
MMLU-Redux 5,385 MCQ (4) Multiple Subjects No test
SimpleQA 4,326 (test), 1,000 (verified) Open ended Factuality, Parametric knowledge No verified

Evaluate NVIDIA-Nemotron-3-Nano on an MCQ dataset

from nemo_skills.pipeline.cli import wrap_arguments, eval
cluster = "slurm"
eval(
    ctx=wrap_arguments(
        "++inference.temperature=1.0 ++inference.top_p=1.0 "
        "++inference.tokens_to_generate=131072 "
        "++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "
    ),
    cluster=cluster,
    server_type="vllm",
    server_gpus=1,
    server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32",
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    benchmarks="gpqa:4",
    output_dir="/workspace/Nano_V3_evals"
)

Evaluate NVIDIA-Nemotron-3-Nano using LLM-as-a-judge

from nemo_skills.pipeline.cli import wrap_arguments, eval
cluster = "slurm"
eval(
    ctx=wrap_arguments(
       "++inference.temperature=1.0 ++inference.top_p=1.0 "
        "++inference.tokens_to_generate=131072 "
        "++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "
    ),
    cluster=cluster,
    server_type="vllm",
    server_gpus=1,
    server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32",
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    benchmarks="hle:4",
    output_dir="/workspace/Nano_V3_evals",
    judge_model="openai/gpt-oss-120b",
    judge_server_type="vllm",
    judge_server_gpus=8,
    judge_server_args="--async-scheduling",
    extra_judge_args="++chat_template_kwargs.reasoning_effort=high  ++inference.temperature=1.0 ++inference.top_p=1.0 ++inference.tokens_to_generate=120000 "
)

Evaluate NVIDIA-Nemotron-3-Nano on an MCQ dataset using tools

from nemo_skills.pipeline.cli import wrap_arguments, eval
cluster = "slurm"
eval(
    ctx=wrap_arguments(
        "++inference.temperature=0.6 ++inference.top_p=0.95 "
        "++inference.tokens_to_generate=131072 "
        "++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "
        "++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "

    ),
    cluster=cluster,
    server_type="vllm",
    server_gpus=1,
    server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32 --enable-auto-tool-choice --tool-call-parser qwen3_coder",
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    benchmarks="gpqa:4",
    output_dir="/workspace/Nano_V3_evals",
    with_sandbox=True,

)