Skip to content

Robustness Evaluation

robust_eval is built on top of ns eval to evaluate the model on multiple benchmarks using different prompt variations.
The purpose is to measure and analyze model robustness against changes in the prompt.

How to Use robust_eval

The usage is nearly identical to the standard ns eval, but includes an additional required argument prompt_set_config to specify the set of prompts per benchmark to evaluate on. All other arguments are standard ns eval arguments. Any argument of ns eval can also be passed here. And any benchmark supported by Nemo-Skills is also supported here.

Each key in the prompt_set_config.yaml file should be a benchmark name and the corresponding value should contain
- prompt_config - the path to the prompt config (required).
- extract_regex - regex pattern for answer extraction (optional, if no value is provided, default answer extraction will be used).

Note - for MCQ datasets, the script will try to extract the answer using both regex and \boxed{}.

Expected format of prompt_set_config.yaml
(full example in nemo_skills/prompt/config/robustness/prompt_set_config.yaml)

gpqa:
  - prompt_config: robustness/mcq_v1/boxed_2
  - prompt_config: robustness/mcq_v1/angle_brackets_1
    extract_regex: '<<([A-Za-z])>>'
  ...
comp-math-24-25:
  - prompt_config: robustness/math_v1/boxed_1
  - prompt_config: robustness/math_v1/boxed_2
  ...

Run Command Example

The following command will launch an ns eval on GPQA and comp-math-24-25 for every prompt specified in prompt_set_config.yaml, across 8 seeds (because of benchmark:8, but the random seeds can be different for every benchmark). For these datasets, we recommend using 8 seeds for compute-constrained evaluations and 16 seeds for more reliable and stable results.
nemo_skills/prompt/config/robustness/prompt_set_config.yaml already contains 10 prompts for GPQA and Comp-Math-24-25.
Note that every prompt is a separate job, and all parameters are shared for all jobs. For example, if num_jobs is specified, it will launch num_jobs jobs per prompt, not overall.

from nemo_skills.pipeline.cli import wrap_arguments, robust_eval
robust_eval(ctx=wrap_arguments(
        "++inference.temperature=0.6 "
        "++inference.top_p=0.95 "
        "++parse_reasoning=True "
    ),
    prompt_set_config='robustness/prompt_set_config', # OR absolute path to .yaml file
    cluster=cluster_config,
    model="Qwen/Qwen3-8B",
    server_type='vllm',
    output_dir="/workspace/robustness_eval/Qwen3-8B/",
    benchmarks="gpqa:8,comp-math-24-25:8",
    server_gpus=2,
    server_nodes=1,
    expname='test',
)

An example of the expected output_dir structure:

output_dir/
├── gpqa/
│   ├── prompt1/
│   │   ├── output-rs0.jsonl
│   │   └── output-rs1.jsonl
│   └── prompt2/
│       ├── output-rs0.jsonl
│       └── output-rs1.jsonl
└── comp-math-24-25/
    └── ...

Summarize Robustness

When all evaluations are done, summarize_robustness is automatically launched to process the generated files and produce aggregated metrics. The following metrics are calculated:

  • Aggregated Benchmark Statistics: For each benchmark across all prompts and seeds, the script calculates:

    • min, max, avg, std: Statistical metrics across all runs per benchmark.
    • prompt_sensitivity: The standard deviation of the average scores across different prompts, which measures how sensitive the model's accuracy is to prompt variations.
  • Per-Prompt Statistics: For each prompt across all random seeds, the script calculates:

    • min, max, avg, std: Statistical metrics for a single prompt across seeds.
    • no_answer: The proportion of questions for which no answer was extracted from the generation, either due to a wrong answer format or no answer at all (can be used to find prompts that break the model predictions).

An example of the output file generated by summarize_robustness in output_dir/summarize_robustness/main*.log. First, for each benchmark, metrics are aggregated across all prompts and seeds. Then, there is a breakdown per benchmark and per prompt across seeds.
All calculated metrics are also saved to output_dir/metrics.json.

dataset              |   min   |   max   |   avg   |   std   | prompt_sensitivity
----------------------------------------------------------------------------------
comp-math-24-25@80   |  48.05  |  53.91  |  51.10  |   1.60  |  0.34
gpqa@80              |  50.51  |  60.61  |  55.51  |   2.44  |  0.77


------------------------------------- comp-math-24-25 ----------------------------
prompt@8             |   min   |   max   |   avg   |   std   | no_answer
----------------------------------------------------------------------------------
prompt_1             |  48.05  |  53.91  |  50.76  |   1.61  | 1.56
...
prompt_10            |  48.44  |  53.91  |  51.44  |   1.52  | 1.66


-------------------------------------- gpqa --------------------------------------
prompt@8             |   min   |   max   |   avg   |   std   |  no_answer
----------------------------------------------------------------------------------
prompt_1             |  50.51  |  60.61  |  54.73  |   2.68  |  3.03
...
prompt_10            |  53.54  |  60.10  |  56.28  |   1.88  |  2.78

Notes on Usage

  • There are 10 Math, 10 MCQ and 7 LiveCodeBench prompts in the prompt/config/robustness folder, along with the prompt_set_config.yaml. Those prompts vary by prompt wording and problem placement. MCQ prompts also vary by answer formatting instruction, while Math prompts use only \boxed{} format. prompt/config/robustness/math_prompts can be used for any Math (AIME, comp-math-24-25, etc) benchmarks, prompt/config/robustness/mcq_prompts for any MCQ (GPQA, MMLU-Pro, etc) benchmarks.
  • robust_eval can be used with any dataset that Nemo-Skills supports, but summarize_robustness works on Math, MCQ, LiveCodeBench datasets and any dataset with judge evaluation (for now). If you need evaluations on multiple prompts, you can still use robust_eval. However, the summarize_robustness part won't work.