Nemotron-Math-Proofs¶
Dataset Overview¶
Using our pipelines we created Nemotron-Math-Proofs-v1, a large-scale mathematical reasoning dataset containing approximately 580k natural language proof problems, 550k formalizations into theorem statements in Lean 4, and 900k model-generated reasoning trajectories culminating in Lean 4 proofs.
The dataset integrates human-authored problems with systematically generated formalizations and solution traces:
- Natural Language Problems: ~580k proof problems sourced from AoPS forums, Math StackExchange, and MathOverflow, filtered for proof-based problems that can be formulated as theorems, semantically deduplicated, and decontaminated against popular benchmarks (removing ~40% of the original dataset). See our paper for details on problem sources and the extraction pipeline.
- Formal Statements: ~550k Lean 4 theorem statements generated via autoformalization
- Proof Trajectories: ~900k verified reasoning traces and proofs
Each natural language problem is formalized by gpt-oss-120b into Lean 4 theorem statements. Reasoning traces and proofs are generated by Goedel-Prover-V2-32B and verified by the Lean 4 compiler.
This dataset was used as part of the SFT data for Nemotron-Nano-v3.
Training Results¶
We compare a Qwen3-8B model fine-tuned on this dataset against Goedel-Prover-V2-8B on miniF2F:
| Model | pass@32 (no self-correction) | pass@32 (with self-correction) |
|---|---|---|
| Goedel-Prover-V2-8B | 84.6% | 86.7%* |
| Qwen3-8B SFT on Nemotron-Math-Proofs-v1 | 85.3% | 90.2%** |
* Goedel-Prover-V2 uses 3 self-correction attempts (limited by context length). ** Our modified pipeline supports unbounded attempts; we use 8.
Nemotron-Nano-v3 (which includes this dataset in its training) achieves the following on miniF2F (all evaluations use 8 self-correction attempts):
| Model | pass@32 (no self-correction) | pass@32 (with self-correction) |
|---|---|---|
| Nemotron-Nano-v3 | 79.92% | 86.89% |
| gpt-oss-20b | 43.03% | 59.42% |
| Qwen3-30B-A3B-Thinking | 16.80% | 32.37% |
How to Reproduce¶
Browse the sections below to see commands for autoformalization, theorem proving, training, and evaluation.
Note
These commands assume you have /workspace defined in your cluster config.
Adjust paths and cluster settings according to your environment.
Autoformalization¶
The autoformalization pipeline translates natural language theorems into Lean 4 formal statements using an iterative
refinement process with backtranslation verification. The input is natural language math problems (the problem field
from the dataset)—see
OpenMathReasoning dataset construction for how to prepare these.
ns generate \
--cluster=slurm \
--generation_module=nemo_skills.inference.autoformalize \
--model=openai/gpt-oss-120b \
--server_type=vllm \
--server_gpus=<NUM_GPUS> \
--input_file=/workspace/data/problems.jsonl \
--output_dir=/workspace/data/autoformalize_output \
--with_sandbox \
--num_random_seeds=1 \
++prompt_config=lean4/autoformalization \
++inference.tokens_to_generate=120000 \
++inference.temperature=1.0 \
++inference.top_p=1.0 \
++adaptive_reasoning=True \
++refinement=True \
++judge_enabled=True \
++refinement_max_turns=8 \
++backtranslation_prompt_config=lean4/backtranslation \
++judge_prompt_config=lean4/judge-backtranslation \
++refine_consistent_error_prompt_config=lean4/refinement_consistent_error \
++refine_parsing_error_prompt_config=lean4/refinement_parsing_error \
++refine_code_error_prompt_config=lean4/refinement_code_error
from nemo_skills.pipeline.cli import generate, wrap_arguments
generate(
ctx=wrap_arguments(
"++inference.tokens_to_generate=120000 "
"++prompt_config=lean4/autoformalization "
"++inference.temperature=1.0 "
"++inference.top_p=1.0 "
"++adaptive_reasoning=True "
"++refinement=True "
"++judge_enabled=True "
"++refinement_max_turns=8 "
"++backtranslation_prompt_config=lean4/backtranslation "
"++judge_prompt_config=lean4/judge-backtranslation "
"++refine_consistent_error_prompt_config=lean4/refinement_consistent_error "
"++refine_parsing_error_prompt_config=lean4/refinement_parsing_error "
"++refine_code_error_prompt_config=lean4/refinement_code_error "
),
generation_module="nemo_skills.inference.autoformalize",
cluster="slurm",
server_gpus="<NUM_GPUS>",
input_file="/workspace/data/problems.jsonl",
output_dir="/workspace/data/autoformalize_output",
server_type="vllm",
model="openai/gpt-oss-120b",
with_sandbox=True,
num_random_seeds=1,
)
The pipeline includes:
- Initial Formalization: LLM translates natural language to Lean 4 code
- Compilation Check: Lean 4 sandbox verifies syntactic correctness
- Backtranslation & Verification: Formal code is backtranslated and compared to the original
- Iterative Refinement: Up to 8 iterations to fix parsing, compilation, or semantic errors
- Adaptive Reasoning: Automatically reduces reasoning effort when hitting context limits
Theorem Proving¶
The prover pipeline generates proofs for formalized statements with iterative error correction.
Input: formal statements from the autoformalization step (the formal_statement and lean_header fields
from the dataset).
ns generate \
--cluster=slurm \
--generation_module=nemo_skills.inference.prover \
--model=Goedel-LM/Goedel-Prover-V2-32B \
--server_type=vllm \
--server_gpus=<NUM_GPUS> \
--server_args="--max-model-len 40960" \
--input_file=/workspace/data/formal_statements.jsonl \
--output_dir=/workspace/data/proofs_output \
--with_sandbox \
--num_random_seeds=1 \
++prompt_config=lean4/goedel-prover-v2 \
++inference.tokens_to_generate=38912 \
++inference.temperature=1.0 \
++inference.top_p=0.95 \
++refinement=True \
++refinement_max_turns=8 \
++remove_cot=True \
++n_pass=4 \
++refinement_prompt_config=lean4/goedel-prover-v2-refinement \
++delete_wrong_turns=True \
++max_concurrent_requests=512
from nemo_skills.pipeline.cli import generate, wrap_arguments
generate(
ctx=wrap_arguments(
"++inference.tokens_to_generate=38912 "
"++inference.temperature=1.0 "
"++inference.top_p=0.95 "
"++prompt_config=lean4/goedel-prover-v2 "
"++refinement=True "
"++refinement_max_turns=8 "
"++remove_cot=True "
"++n_pass=4 "
"++refinement_prompt_config=lean4/goedel-prover-v2-refinement "
"++delete_wrong_turns=True "
"++max_concurrent_requests=512 "
),
generation_module="nemo_skills.inference.prover",
cluster="slurm",
input_file="/workspace/data/formal_statements.jsonl",
output_dir="/workspace/data/proofs_output",
model="Goedel-LM/Goedel-Prover-V2-32B",
server_type="vllm",
server_gpus="<NUM_GPUS>",
server_args="--max-model-len 40960",
num_random_seeds=1,
with_sandbox=True,
)
The proving strategy includes:
- Chain-of-thought removal: Strips reasoning, keeping only formal proof code
- Wrong turn deletion: Discards failed attempts to prevent context pollution
- Structured error feedback: Compiler errors annotated with
<error></error>tags - Pass@N with refinement: Multiple independent attempts, each with iterative refinement
Model Training¶
To fine-tune a model on the Nemotron-Math-Proofs dataset.
Input: processed SFT data from the theorem proving step (the messages field from the
dataset contains verified proof conversations).
ns sft_nemo_rl \
--cluster=slurm \
--expname=qwen3-8b-lean-sft \
--output_dir=/workspace/training/qwen3-8b-lean-sft \
--hf_model=Qwen/Qwen3-8B \
--training_data=/workspace/data/sft_data.jsonl \
--num_nodes=<NUM_NODES> \
--num_gpus=<NUM_GPUS> \
--backend=megatron \
++checkpointing.save_period=250 \
++sft.max_num_epochs=2000 \
++sft.max_num_steps=1000 \
++policy.megatron_cfg.tensor_model_parallel_size=4 \
++policy.megatron_cfg.pipeline_model_parallel_size=1 \
++policy.megatron_cfg.context_parallel_size=4 \
++policy.train_global_batch_size=2048 \
++policy.max_total_sequence_length=49152 \
++policy.megatron_cfg.optimizer.lr=1e-4 \
++policy.megatron_cfg.optimizer.min_lr=1e-4 \
++policy.megatron_cfg.scheduler.lr_warmup_iters=0
from nemo_skills.pipeline.cli import sft_nemo_rl, wrap_arguments
sft_nemo_rl(
ctx=wrap_arguments(
"++checkpointing.save_period=250 "
"++sft.max_num_epochs=2000 "
"++sft.max_num_steps=1000 "
"++policy.megatron_cfg.tensor_model_parallel_size=4 "
"++policy.megatron_cfg.pipeline_model_parallel_size=1 "
"++policy.megatron_cfg.context_parallel_size=4 "
"++policy.train_global_batch_size=2048 "
"++policy.max_total_sequence_length=49152 "
"++policy.megatron_cfg.optimizer.lr=1e-4 "
"++policy.megatron_cfg.optimizer.min_lr=1e-4 "
"++policy.megatron_cfg.scheduler.lr_warmup_iters=0 "
),
cluster="slurm",
expname="qwen3-8b-lean-sft",
backend="megatron",
output_dir="/workspace/training/qwen3-8b-lean-sft",
hf_model="Qwen/Qwen3-8B",
training_data="/workspace/data/sft_data.jsonl",
num_gpus="<NUM_GPUS>",
num_nodes="<NUM_NODES>",
)
Model Evaluation on miniF2F¶
To evaluate a model on the miniF2F benchmark with 32 samples per problem (without self-correction):
ns eval \
--cluster=slurm \
--model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--server_type=vllm \
--server_gpus=<NUM_GPUS> \
--output_dir=/workspace/evals/nemotron-nano-3-minif2f \
--benchmarks=minif2f:32 \
--with_sandbox \
++prompt_config=lean4/goedel-prover-v2-nemotron \
++inference.tokens_to_generate=120000 \
++inference.temperature=1.0 \
++inference.top_p=1.0 \
--extra_eval_args="++eval_config.timeout=600"
from nemo_skills.pipeline.cli import eval, wrap_arguments
eval(
ctx=wrap_arguments(
"++prompt_config=lean4/goedel-prover-v2-nemotron "
"++inference.tokens_to_generate=120000 "
"++inference.temperature=1.0 "
"++inference.top_p=1.0 "
),
cluster="slurm",
model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
server_type="vllm",
server_gpus="<NUM_GPUS>",
benchmarks="minif2f:32",
output_dir="/workspace/evals/nemotron-nano-3-minif2f",
extra_eval_args="++eval_config.timeout=600",
with_sandbox=True,
)
To evaluate with self-correction (iterative refinement based on compiler feedback):
ns generate \
--cluster=slurm \
--generation_module=nemo_skills.inference.prover \
--model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--server_type=vllm \
--server_gpus=<NUM_GPUS> \
--input_file=/nemo_run/code/nemo_skills/dataset/minif2f/test.jsonl \
--output_dir=/workspace/evals/nemotron-nano-3-minif2f-self-correction \
--with_sandbox \
--num_random_seeds=32 \
++prompt_config=lean4/goedel-prover-v2-nemotron \
++inference.tokens_to_generate=120000 \
++inference.temperature=1.0 \
++inference.top_p=1.0 \
++refinement=True \
++refinement_max_turns=8 \
++remove_cot=True \
++n_pass=1 \
++refinement_prompt_config=lean4/goedel-prover-v2-refinement \
++delete_wrong_turns=True \
++max_concurrent_requests=512
from nemo_skills.pipeline.cli import generate, wrap_arguments
generate(
ctx=wrap_arguments(
"++inference.tokens_to_generate=120000 "
"++inference.temperature=1.0 "
"++inference.top_p=1.0 "
"++prompt_config=lean4/goedel-prover-v2-nemotron "
"++refinement=True "
"++refinement_max_turns=8 "
"++remove_cot=True "
"++n_pass=1 "
"++refinement_prompt_config=lean4/goedel-prover-v2-refinement "
"++delete_wrong_turns=True "
"++max_concurrent_requests=512 "
),
generation_module="nemo_skills.inference.prover",
cluster="slurm",
input_file="/nemo_run/code/nemo_skills/dataset/minif2f/test.jsonl",
output_dir="/workspace/evals/nemotron-nano-3-minif2f-self-correction",
model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
server_type="vllm",
server_gpus="<NUM_GPUS>",
num_random_seeds=32,
with_sandbox=True,
)
To summarize evaluation results:
Known Limitations¶
- Difficulty balance: No explicit normalization of problem difficulty; implicit selection biases from pipeline stages
- Token/length normalization: Some solutions contain warnings (e.g., unused hypotheses) since verification only checks for compilation errors
- Placeholder definitions: Some formalizations use trivial placeholder definitions instead of mathlib (e.g.,
def MyContinuous... Prop := True)