Scientific Knowledge¶
Nemo-Skills can be used to evaluate an LLM on various STEM datasets.
Dataset Overview¶
Dataset |
Questions |
Types |
Domain |
Images? |
NS default |
|---|---|---|---|---|---|
| HLE | 2,500 | Open ended, MCQ | Engineering, Physics, Chemistry, Bio, etc. | Yes | text only |
| HLE-Verified | 2,500 | Open ended, MCQ | Engineering, Physics, Chemistry, Bio, etc. | Yes | gold+revision text only |
| GPQA | 448 (main) 198 (diamond)546 (ext.) |
MCQ (4) | Physics, Chemistry, Biology | No | diamond |
| SuperGPQA | 26,529 | MCQ (≤ 10) | Science, Eng, Humanities, etc. | No | test |
| MMLU-Pro | 12,032 | MCQ (≤ 10) | Multiple subjects | No | test |
| SciCode | 80(338 subtasks) | Code gen | Scientific computing | No | test+val |
| FrontierScience | 100 | Short-answer | Physics, Chemistry, Biology | No | all |
| Physics | 1,000 (EN), 1,000 (ZH) | Open-ended | Physics | No | EN |
| UGPhysics | 5,520 (EN), 5,520 (ZH) | Open-ended MCQ | Physics | No | EN |
| MMLU | 14,042 | MCQ (4) | Multiple Subjects | No | test |
| MMLU-Redux | 5,385 | MCQ (4) | Multiple Subjects | No | test |
| SimpleQA | 4,326 (test), 1,000 (verified) | Open ended | Factuality, Parametric knowledge | No | verified |
Evaluate NVIDIA-Nemotron-3-Nano on an MCQ dataset¶
from nemo_skills.pipeline.cli import wrap_arguments, eval
cluster = "slurm"
eval(
ctx=wrap_arguments(
"++inference.temperature=1.0 ++inference.top_p=1.0 "
"++inference.tokens_to_generate=131072 "
"++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "
),
cluster=cluster,
server_type="vllm",
server_gpus=1,
server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32",
model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
benchmarks="gpqa:4",
output_dir="/workspace/Nano_V3_evals"
)
Evaluate NVIDIA-Nemotron-3-Nano using LLM-as-a-judge¶
from nemo_skills.pipeline.cli import wrap_arguments, eval
cluster = "slurm"
eval(
ctx=wrap_arguments(
"++inference.temperature=1.0 ++inference.top_p=1.0 "
"++inference.tokens_to_generate=131072 "
"++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "
),
cluster=cluster,
server_type="vllm",
server_gpus=1,
server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32",
model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
benchmarks="hle:4",
output_dir="/workspace/Nano_V3_evals",
judge_model="openai/gpt-oss-120b",
judge_server_type="vllm",
judge_server_gpus=8,
judge_server_args="--async-scheduling",
extra_judge_args="++chat_template_kwargs.reasoning_effort=high ++inference.temperature=1.0 ++inference.top_p=1.0 ++inference.tokens_to_generate=120000 "
)
Evaluate NVIDIA-Nemotron-3-Nano on an MCQ dataset using tools¶
from nemo_skills.pipeline.cli import wrap_arguments, eval
cluster = "slurm"
eval(
ctx=wrap_arguments(
"++inference.temperature=0.6 ++inference.top_p=0.95 "
"++inference.tokens_to_generate=131072 "
"++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "
"++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "
),
cluster=cluster,
server_type="vllm",
server_gpus=1,
server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32 --enable-auto-tool-choice --tool-call-parser qwen3_coder",
model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
benchmarks="gpqa:4",
output_dir="/workspace/Nano_V3_evals",
with_sandbox=True,
)