Skip to content

Scientific knowledge

More details are coming soon!

Supported benchmarks

hle

  • Benchmark is defined in nemo_skills/dataset/hle/__init__.py
  • Original benchmark source is here.
  • The text split includes all non-image examples. It is further divided into eng, chem, bio, cs, phy, math, human, other. Currently, all of these splits contain only text data.

SimpleQA

  • Benchmark is defined in nemo_skills/dataset/simpleqa/__init__.py
  • Original benchmark source code for SimpleQA (OpenAI) is here and the leaderboard is here. An improved version with 1,000 examples from Google, SimpleQA-verified, is here.
  • To use the SimpleQA-verified, set split=verified. To use the original version of SimpleQA, please set split=test.

In the below configurations, we also use gpt-oss-120b as the judge model.

Configuration: gpt-oss-120b with builtin tool (python)

from nemo_skills.pipeline.cli import wrap_arguments, eval
cluster = 'slurm'

eval(
    ctx=wrap_arguments(
                "++inference.temperature=1.0 ++inference.tokens_to_generate=65536 "
                "++code_tags=gpt-oss ++server.code_execution.max_code_executions=100 "
                "++inference.endpoint_type=text ++chat_template_kwargs.builtin_tools=[python] "
                "++chat_template_kwargs.reasoning_effort=high ++code_execution=true "
                "++parse_reasoning=True "
                '\'++end_reasoning_string="<|start|>assistant<|channel|>final<|message|>"\''
    ),
    cluster=cluster,
    expname="simpleqa-gpt-oss-120b-tool-output-only",
    model="openai/gpt-oss-120b",
    server_type="vllm",
    server_gpus=8,
    server_args="--async-scheduling",
    benchmarks="simpleqa:2",
    split="verified",
    output_dir="/workspace/simpleqa-gpt-oss-120b-tool-output-only",
    with_sandbox=True,
    judge_model="openai/gpt-oss-120b",
    judge_server_type="vllm",
    judge_server_gpus=8,
    judge_server_args="--async-scheduling  --reasoning-parser GptOss",
)

Configuration: gpt-oss-120b without tool

from nemo_skills.pipeline.cli import wrap_arguments, eval
cluster = 'slurm'
eval(
    ctx=wrap_arguments(
                "++inference.temperature=1.0 ++inference.tokens_to_generate=100000 "
                "++inference.extra_body.reasoning_effort=high "
    ),
    cluster="ord",
    expname="simpleqa-gpt-oss-120b-notool",
    model="openai/gpt-oss-120b",
    server_type="vllm",
    server_gpus=8,
    server_args="--async-scheduling --reasoning-parser GptOss",
    benchmarks="simpleqa:2",
    split="verified",
    output_dir="/workspace/simpleqa-gpt-oss-120b-notool",
    judge_model="openai/gpt-oss-120b",
    judge_server_type="vllm",
    judge_server_gpus=8,
    judge_server_args="--async-scheduling  --reasoning-parser GptOss",
)

Note

The module name for reasoning-parser differs across vllm versions. Depending on your version, it might appear as openai_gptoss or GptOss. In the latest main branch, it is named openai_gptoss. You can verify this in gptoss_reasoning_parser.py and confirm which version your environment uses.

Result

We also tested a variant where the full generation output was provided to the judge—disabling "parse_reasoning". This configuration, labeled simpleqa-gpt-oss-120b-tool-full-generation, produced results nearly identical to the standard setup where the reasoning portion is excluded from the judge’s input.

Run Name pass@1 majority@2 pass@2
simpleqa-gpt-oss-120b-notool 12.93 12.93 17.22
simpleqa-gpt-oss-120b-tool-full-generation 80.30 80.30 84.78
simpleqa-gpt-oss-120b-tool-output-only 79.51 79.51 83.74

The reported number for simpleqa-gpt-oss-120b-notool is 13.1% according to this kaggle page.

FrontierScience-Olympiad

  • Benchmark is defined in nemo_skills/dataset/frontierscience-olympiad/__init__.py
  • Original benchmark source is here.
  • Contains 100 short-answer questions crafted by international science olympiad medalists across physics, chemistry, and biology.
  • Available splits: physics, chemistry, biology, and all (all subjects combined, default).

Configuration: gpt-oss-20b with builtin tool (python)

from nemo_skills.pipeline.cli import wrap_arguments, eval

eval(
    ctx=wrap_arguments(
        "++inference.temperature=1.0 ++inference.tokens_to_generate=65536 "
        "++code_tags=gpt-oss ++server.code_execution.max_code_executions=100 "
        "++inference.endpoint_type=text ++chat_template_kwargs.builtin_tools=[python] "
        "++chat_template_kwargs.reasoning_effort=high ++code_execution=true"
    ),
    cluster="slurm",
    expname="ghb-model_gpt_oss_20b",
    model="openai/gpt-oss-20b",
    server_type="vllm",
    server_gpus=4,
    server_args="--async-scheduling",
    benchmarks="frontierscience-olympiad:20",
    split="all",
    num_chunks=1,
    output_dir="/workspace/frontierscience-ghb-model_gpt_oss_20b",
    with_sandbox=True,
    wandb_project="frontier",
    wandb_name="frontierscience-ghb-model_gpt_oss_20b",
    judge_model="openai/gpt-oss-120b",
    judge_server_type="vllm",
    judge_server_gpus=8,
    judge_server_args="--async-scheduling",
)

Configuration: gpt-oss-120b without tool

from nemo_skills.pipeline.cli import wrap_arguments, eval

eval(
    ctx=wrap_arguments(
        "++inference.temperature=1.0 ++inference.tokens_to_generate=65536 "
        "++inference.extra_body.reasoning_effort=high"
    ),
    cluster="slurm",
    expname="ghn-model_gpt_oss_120b",
    model="openai/gpt-oss-120b",
    server_type="vllm",
    server_gpus=8,
    server_args="--async-scheduling",
    benchmarks="frontierscience-olympiad:20",
    split="all",
    num_chunks=1,
    output_dir="/workspace/frontierscience-ghn-model_gpt_oss_120b",
    wandb_project="frontier",
    wandb_name="frontierscience-ghn-model_gpt_oss_120b",
    judge_model="openai/gpt-oss-120b",
    judge_server_type="vllm",
    judge_server_gpus=8,
    judge_server_args="--async-scheduling",
)

Result

Run Name pass@1 majority@8 pass@8
gpt-oss-20b (no tool) 49.74 47.00 71.98
gpt-oss-20b (with python tool) 36.94 37.38 73.61
gpt-oss-120b (no tool) 60.53 61.13 79.25
gpt-oss-120b (with python tool) 54.05 53.00 80.07

SuperGPQA

scicode

Note

For scicode by default we evaluate on the combined dev + test split (containing 80 problems and 338 subtasks) for consistency with AAI evaluation methodology. If you want to only evaluate on the test set, use --split=test.

gpqa

mmlu-pro

mmlu

mmlu-redux