Dataset construction¶
OpenMathReasoning-1 dataset consists of mathematical problems collected from AoPS community forums. Below we describe the pipeline used to create this dataset. All relevant scripts are available in recipes/openmathreasoning folder.
If you don't have a slurm cluster with a large number of GPUs,
you can still try out all the steps of our pipeline by using Nvidia NIM models. We include
a 10-sample subset of the raw data in configs/example-data.txt and you can
switch to that data and NIM models by adding --mode demo to all the pipeline commands. We also use different models
in this "demo" mode to make it faster, but you can change configs/demo.yaml to pick
any other models supported in https://build.nvidia.com. Make sure to define NVIDIA_API_KEY environment variable for this to work
(and ignore scraping and model preparation steps as they are not needed when using NIM models).
Finally, please make sure to go through the getting started documentation to make sure you understand how the below commands work and avoid running into errors.
Data scraping¶
There is a great open-source AoPS-Instruct repository where you can find the scripts to scrape the data. There is also a DeepStudentLlama/AoPS-Instruct HF dataset where the raw forum data can be found. While we didn't use that repository/dataset in our work directly, it should produce a similar output to our internal scripts.
To download and preprocess raw data you can run
This script will rename certain columns in the original dataset to align with our scripts, combine forum discussions into
a single string, remove quotes and truncate the discussions that are longer than 24000 tokens. The prepared data will be
saved as raw_aops_data.jsonl.
The output file should have ~550k rows, so all of the following commands will take a very long time and require a big number of GPUs if you want to run them on full data. If you just want to try out the full pipeline, we recommend to subsample the dataset by e.g. running
mv raw_aops_data.jsonl raw_aops_data_full.jsonl
head -n 1000 raw_aops_data_full.jsonl > raw_aops_data.jsonl
Model conversion¶
Here are the steps to download/convert all models that we used to create this dataset.
Download the models by running this on cluster from the path that is mounted as /hf_models in your cluster config.
Problem generation pipeline¶
Problem generation pipeline consists of the following stages:
- Extract all problems
from the first forum post (
extract_problemsstage). - Classify whether each problem belongs to one of the following categories:
proof question,
binary question,
multiple-choice question,
invalid question
(
classify_problemsstage). - Extract answers
from the forum discussions (
extract_answersstage). - Convert proof questions
to answer questions (
convert_proofsstage). - Remove all binary/multiple-choice/invalid problems and merge remaining problems with converted proofs (
merge_datastage). - Decontaminate the resulting questions with popular math benchmarks (
decontaminatestage).
You can run the full pipeline with
You can specify a subset of stages using --stages argument, e.g. --stages extract_problems or --stages classify_problems,extract_answers.
If you want to run using Nvidia NIM models on 10 example questions, add --mode demo.
CoT solution generation pipeline¶
Solution generation pipeline consists of the following stages:
- Generate solutions for each of the prepared problems (
generate_solutionsstage). - Fill majority answer
for all problems where ground-truth answer is not known (
fill_majority_answerstage). - Judge answers using an LLM. Only the final answer is compared to the ground-truth (or majority) answer, not the full solution (
judge_answersstage). - [Optional] Generate new summaries for reasoning solutions, as candidates for replacing the original summary (
generate_new_summariesstage). - [Optional] Judge new summaries to judge the new summaries. This is required to make sure we're only replacing the original summaries with valid new summaries (
judge_new_summariesstage). - [Optional] Merge new summaries with the original reasoning solution (
merge_new_summariesstage). - Filter out all incorrect solutions and prepare the data for SFT (
prepare_for_sftstage).
You can run the full pipeline using QwQ-32B as solution generation model with
You can specify a subset of stages using --stages argument and can switch between QwQ and R1 models using --mode qwq or --mode r1.
If you want to run using Nvidia NIM models on 10 example questions, add --mode demo.
TIR solution generation pipeline¶
Tool-Integrated Reasoning (TIR) solution generation pipeline focuses on generating solutions that leverage external tools, more specifically, a Python interpreter. This pipeline consists of several stages, some of which are optional:
- Generate solutions using a TIR-capable model (
generate_solutionsstage). These solutions interleave reasoning steps with executable code blocks. - Fill majority answer
for problems without ground-truth answers (
fill_majority_answerstage). - Judge answers using an LLM, comparing the final answer to the ground-truth or majority answer (
judge_answersstage). - Postprocess generations, including filtering and potentially standardizing code block formats (
postprocess_tir_generationsstage). - [Optional] Extract Python code fragments from solutions (
extract_python_fragments). - [Optional] Judge the novelty and significance of these fragments using an LLM (
judge_novelty,judge_significance). - [Optional] Filter fragments based on novelty/significance scores (
filter_fragments). - [Optional] Generate new summaries for reasoning solutions, as candidates for replacing the original summary (
generate_new_summariesstage). - [Optional] Judge new summaries to judge the new summaries. This is required to make sure we're only replacing the original summaries with valid new summaries (
judge_new_summariesstage). - [Optional] Merge new summaries with the original reasoning solution (
merge_new_summariesstage). - Prepare the final dataset for SFT (
prepare_for_sftstage).
We provide configurations for two TIR variants:
- Using LIMO: This variant (
tir-limo.yaml) uses the LIMO model and includes strict filtering steps based on code fragment novelty and significance. These steps are marked with [Optional] in the list above and should typically be run together or skipped together. Run with: - Using OpenMath-Nemotron: This variant (
tir-openmath.yaml) uses our OpenMath-Nemotron-14B model. It produces solutions with higher-quality Python code, requiring less strict filtering. Run with:
You can specify a subset of stages using the --stages argument for either mode.
GenSelect Generation Pipeline¶
GenSelect generation pipeline creates the GenSelect input-output instances. The pipeline relies on the following stages:
- Prepare instances comparing different solutions (summaries of these solutions) for a given problem (
prepare_labeling_datastage). - Generating solutions for the comparison instances where we use a reasoning model to output the judgment of what solution is the top-ranking one according to the model (
label_datastage). - Extract judgments from the reasoning trace and filter out judgments that pick the wrong solutions (
extract_judgmentstage). - Generate new summaries for these judgment reasoning traces (we generate 4 summary per reasoning trace). These summaries can replace the costly reasoning traces as GenSelect targets (
generate_new_summariesstage). - Select the best valid summary (where the judgment matches the reasoning trace's judgment) as target for GenSelect (
merge_new_summariesstage). - Prepare data for SFT using the GenSelect template (
prepare_for_sftstage).
We provide a configuration qwq (qwq.yaml) which uses the Qwen/QwQ-32B model for labeling the comparison instances. You can run this configuration as:
--stages argument.