Skip to content

Long-context

More details are coming soon!

Supported benchmarks

ruler

Data preparation

See example of data preparation command in main evaluation docs. By default we will run evaluation in the setup closest to original paper which requires us starting assistant response with an answer prefix. This is only possible through text-completion api and might not be applicable for reasoning models or chat models in general. If you want to change that to avoid starting the assistant answer, use

ns prepare_data ruler --data_format chat <other arguments>

Other supported options

  • default: evaluate non-reasoning model only with answer prefix.
  • base: evaluate base model with answer prefix.
  • chat: evaluate chat model including non-reasoning and reasoning model without answer prefix.

ruler2

It's recommended to use data_dir parameter when running evaluation. Ruler2 also requires setup, tokenizer_path and max_seq_length to be specified. Example command to prepare data

ns prepare_data ruler2 \
    --cluster=<cluster config> \
    --data_dir=<mounted location to store data into> \
    --setup=<typically MODEL_NAME-LENGTH but can be any string> \
    --tokenizer_path=<model name, e.g. Qwen/Qwen3-1.7B> \
    --max_seq_length=<length you want to evaluate, e.g. 131072>

Example evaluation command

ns eval \
    --cluster=<cluster config> \
    --data_dir=<must match prepare_data parameter> \
    --output_dir=<any mounted output location> \
    --benchmarks=ruler2.<what you used for prepare_data setup argument> \
    --model=<model name, e.g. Qwen/Qwen3-1.7B> \
    --server_nodes=1 \
    --server_gpus=8 \
    --server_type=vllm

Example scores

Model Avg 8192 16384 32768 65536 131072 262144 524288 1000000
Gemini 2.5 Flash Think On 91.4 94.3 93.7 91.4 88.4 89.0 - - -
Gemini 2.5 Flash Think Off 88.0 91.3 89.0 88.8 85.5 85.5 82.5 79.1 77.0
GPT 4.1 89.2 91.2 90.8 89.8 87.7 86.5 80.6 74.5 75.2
Qwen3-235B-A22B-Thinking-2507 85.2 92.9 91.3 85.3 80.6 75.7 - - -
Qwen3-235B-A22B-Instruct-2507 83.7 87.3 85.8 84.5 82.5 78.2 65.3 53.0 36.1

For more details see https://github.com/NVIDIA/RULER/blob/rulerv2-ns

mrcr

aalcr

Data preparation

ns prepare_data \
    --data_dir=/workspace/ns-data \
    --cluster=<cluster_config> \
    aalcr
You can also prepare a subset of the data with limited context window.
    --max_context_window 100000 --setup test_100k

Running evaluation

This setup follows the official AA-LCR implementation. The judge model is Qwen3-235B-A22B-Instruct-2507, and the evaluation is repeated four times.

model=Qwen2.5-7B-Instruct-1M
ns eval \
    --cluster=<cluster_config> \
    --data_dir=/workspace/ns-data \
    --server_gpus=8 \
    --server_type=sglang \
    --model=/hf_models/$model \
    --benchmarks=aalcr:4 \
    --output_dir=/workspace/aalcr/$model \
    --judge_model='/hf_models/Qwen3-235B-A22B-Instruct-2507' \
    --judge_server_type='sglang' \
    --judge_server_gpus=8 \
    --server_args='--disable-cuda-graph' \
The results, including per-category scores, are stored in metrics.json. Detailed breakdowns by category and sequence length are also available via
ns summarize_results --cluster=<cluster_config> <folder_of_output_json>