Dataset construction¶

OpenMathReasoning-1 dataset consists of mathematical problems collected from AoPS community forums. Below we describe the pipeline used to create this dataset. All relevant scripts are available in recipes/openmathreasoning folder.

If you don't have a slurm cluster with a large number of GPUs, you can still try out all the steps of our pipeline by using Nvidia NIM models. We include a 10-sample subset of the raw data in configs/example-data.txt and you can switch to that data and NIM models by adding --mode demo to all the pipeline commands. We also use different models in this "demo" mode to make it faster, but you can change configs/demo.yaml to pick any other models supported in https://build.nvidia.com. Make sure to define NVIDIA_API_KEY environment variable for this to work (and ignore scraping and model preparation steps as they are not needed when using NIM models).

Finally, please make sure to go through the getting started documentation to make sure you understand how the below commands work and avoid running into errors.

Data scraping¶

There is a great open-source AoPS-Instruct repository where you can find the scripts to scrape the data. There is also a DeepStudentLlama/AoPS-Instruct HF dataset where the raw forum data can be found. While we didn't use that repository/dataset in our work directly, it should produce a similar output to our internal scripts.

To download and preprocess raw data you can run

python scripts/prepare_raw_data.py

This script will rename certain columns in the original dataset to align with our scripts, combine forum discussions into a single string, remove quotes and truncate the discussions that are longer than 24000 tokens. The prepared data will be saved as raw_aops_data.jsonl.

The output file should have ~550k rows, so all of the following commands will take a very long time and require a big number of GPUs if you want to run them on full data. If you just want to try out the full pipeline, we recommend to subsample the dataset by e.g. running

mv raw_aops_data.jsonl raw_aops_data_full.jsonl
head -n 1000 raw_aops_data_full.jsonl > raw_aops_data.jsonl

Model conversion¶

Here are the steps to download/convert all models that we used to create this dataset.

Download the models by running this on cluster from the path that is mounted as /hf_models in your cluster config.

pip install -U "huggingface_hub[cli]"
hf download deepseek-ai/DeepSeek-R1 --local-dir DeepSeek-R1

Problem generation pipeline¶

Problem generation pipeline consists of the following stages:

Extract all problems from the first forum post (extract_problems stage).
Classify whether each problem belongs to one of the following categories: proof question, binary question, multiple-choice question, invalid question (classify_problems stage).
Extract answers from the forum discussions (extract_answers stage).
Convert proof questions to answer questions (convert_proofs stage).
Remove all binary/multiple-choice/invalid problems and merge remaining problems with converted proofs (merge_data stage).
Decontaminate the resulting questions with popular math benchmarks (decontaminate stage).

You can run the full pipeline with

python recipes/openmathreasoning/pipeline/problem_generation.py

You can specify a subset of stages using --stages argument, e.g. --stages extract_problems or --stages classify_problems,extract_answers.

If you want to run using Nvidia NIM models on 10 example questions, add --mode demo.

CoT solution generation pipeline¶

Solution generation pipeline consists of the following stages:

Generate solutions for each of the prepared problems (generate_solutions stage).
Fill majority answer for all problems where ground-truth answer is not known (fill_majority_answer stage).
Judge answers using an LLM. Only the final answer is compared to the ground-truth (or majority) answer, not the full solution (judge_answers stage).
[Optional] Generate new summaries for reasoning solutions, as candidates for replacing the original summary (generate_new_summaries stage).
[Optional] Judge new summaries to judge the new summaries. This is required to make sure we're only replacing the original summaries with valid new summaries (judge_new_summaries stage).
[Optional] Merge new summaries with the original reasoning solution (merge_new_summaries stage).
Filter out all incorrect solutions and prepare the data for SFT (prepare_for_sft stage).

You can run the full pipeline using QwQ-32B as solution generation model with

python recipes/openmathreasoning/pipeline/solution_generation.py --mode qwq

You can specify a subset of stages using --stages argument and can switch between QwQ and R1 models using --mode qwq or --mode r1.

If you want to run using Nvidia NIM models on 10 example questions, add --mode demo.

TIR solution generation pipeline¶

Tool-Integrated Reasoning (TIR) solution generation pipeline focuses on generating solutions that leverage external tools, more specifically, a Python interpreter. This pipeline consists of several stages, some of which are optional:

Generate solutions using a TIR-capable model (generate_solutions stage). These solutions interleave reasoning steps with executable code blocks.
Fill majority answer for problems without ground-truth answers (fill_majority_answer stage).
Judge answers using an LLM, comparing the final answer to the ground-truth or majority answer (judge_answers stage).
Postprocess generations, including filtering and potentially standardizing code block formats (postprocess_tir_generations stage).
[Optional] Extract Python code fragments from solutions (extract_python_fragments).
[Optional] Judge the novelty and significance of these fragments using an LLM (judge_novelty, judge_significance).
[Optional] Filter fragments based on novelty/significance scores (filter_fragments).
[Optional] Generate new summaries for reasoning solutions, as candidates for replacing the original summary (generate_new_summaries stage).
[Optional] Judge new summaries to judge the new summaries. This is required to make sure we're only replacing the original summaries with valid new summaries (judge_new_summaries stage).
[Optional] Merge new summaries with the original reasoning solution (merge_new_summaries stage).
Prepare the final dataset for SFT (prepare_for_sft stage).

We provide configurations for two TIR variants:

Using LIMO: This variant (tir-limo.yaml) uses the LIMO model and includes strict filtering steps based on code fragment novelty and significance. These steps are marked with [Optional] in the list above and should typically be run together or skipped together. Run with:
```
python recipes/openmathreasoning/pipeline/solution_generation.py --mode tir-limo
```
Using OpenMath-Nemotron: This variant (tir-openmath.yaml) uses our OpenMath-Nemotron-14B model. It produces solutions with higher-quality Python code, requiring less strict filtering. Run with:
```
python recipes/openmathreasoning/pipeline/solution_generation.py --mode tir-openmath
```

You can specify a subset of stages using the --stages argument for either mode.

GenSelect Generation Pipeline¶

GenSelect generation pipeline creates the GenSelect input-output instances. The pipeline relies on the following stages:

Prepare instances comparing different solutions (summaries of these solutions) for a given problem (prepare_labeling_data stage).
Generating solutions for the comparison instances where we use a reasoning model to output the judgment of what solution is the top-ranking one according to the model (label_data stage).
Extract judgments from the reasoning trace and filter out judgments that pick the wrong solutions (extract_judgment stage).
Generate new summaries for these judgment reasoning traces (we generate 4 summary per reasoning trace). These summaries can replace the costly reasoning traces as GenSelect targets (generate_new_summaries stage).
Select the best valid summary (where the judgment matches the reasoning trace's judgment) as target for GenSelect (merge_new_summaries stage).
Prepare data for SFT using the GenSelect template (prepare_for_sft stage).

We provide a configuration qwq (qwq.yaml) which uses the Qwen/QwQ-32B model for labeling the comparison instances. You can run this configuration as:

python recipes/openmathreasoning/pipeline/genselect_generation.py --mode qwq

You can specify a subset of stages using the --stages argument.