Skip to content

Nemotron-Math-Proofs

Dataset Overview

Using our pipelines we created Nemotron-Math-Proofs-v1, a large-scale mathematical reasoning dataset containing approximately 580k natural language proof problems, 550k formalizations into theorem statements in Lean 4, and 900k model-generated reasoning trajectories culminating in Lean 4 proofs.

The dataset integrates human-authored problems with systematically generated formalizations and solution traces:

  • Natural Language Problems: ~580k proof problems sourced from AoPS forums, Math StackExchange, and MathOverflow, filtered for proof-based problems that can be formulated as theorems, semantically deduplicated, and decontaminated against popular benchmarks (removing ~40% of the original dataset). See our paper for details on problem sources and the extraction pipeline.
  • Formal Statements: ~550k Lean 4 theorem statements generated via autoformalization
  • Proof Trajectories: ~900k verified reasoning traces and proofs

Each natural language problem is formalized by gpt-oss-120b into Lean 4 theorem statements. Reasoning traces and proofs are generated by Goedel-Prover-V2-32B and verified by the Lean 4 compiler.

This dataset was used as part of the SFT data for Nemotron-Nano-v3.

Training Results

We compare a Qwen3-8B model fine-tuned on this dataset against Goedel-Prover-V2-8B on miniF2F:

Model pass@32 (no self-correction) pass@32 (with self-correction)
Goedel-Prover-V2-8B 84.6% 86.7%*
Qwen3-8B SFT on Nemotron-Math-Proofs-v1 85.3% 90.2%**

* Goedel-Prover-V2 uses 3 self-correction attempts (limited by context length). ** Our modified pipeline supports unbounded attempts; we use 8.

Nemotron-Nano-v3 (which includes this dataset in its training) achieves the following on miniF2F (all evaluations use 8 self-correction attempts):

Model pass@32 (no self-correction) pass@32 (with self-correction)
Nemotron-Nano-v3 79.92% 86.89%
gpt-oss-20b 43.03% 59.42%
Qwen3-30B-A3B-Thinking 16.80% 32.37%

How to Reproduce

Browse the sections below to see commands for autoformalization, theorem proving, training, and evaluation.

Note

These commands assume you have /workspace defined in your cluster config. Adjust paths and cluster settings according to your environment.

Autoformalization

The autoformalization pipeline translates natural language theorems into Lean 4 formal statements using an iterative refinement process with backtranslation verification. The input is natural language math problems (the problem field from the dataset)—see OpenMathReasoning dataset construction for how to prepare these.

ns generate \
    --cluster=slurm \
    --generation_module=nemo_skills.inference.autoformalize \
    --model=openai/gpt-oss-120b \
    --server_type=vllm \
    --server_gpus=<NUM_GPUS> \
    --input_file=/workspace/data/problems.jsonl \
    --output_dir=/workspace/data/autoformalize_output \
    --with_sandbox \
    --num_random_seeds=1 \
    ++prompt_config=lean4/autoformalization \
    ++inference.tokens_to_generate=120000 \
    ++inference.temperature=1.0 \
    ++inference.top_p=1.0 \
    ++adaptive_reasoning=True \
    ++refinement=True \
    ++judge_enabled=True \
    ++refinement_max_turns=8 \
    ++backtranslation_prompt_config=lean4/backtranslation \
    ++judge_prompt_config=lean4/judge-backtranslation \
    ++refine_consistent_error_prompt_config=lean4/refinement_consistent_error \
    ++refine_parsing_error_prompt_config=lean4/refinement_parsing_error \
    ++refine_code_error_prompt_config=lean4/refinement_code_error
from nemo_skills.pipeline.cli import generate, wrap_arguments

generate(
    ctx=wrap_arguments(
        "++inference.tokens_to_generate=120000 "
        "++prompt_config=lean4/autoformalization "
        "++inference.temperature=1.0 "
        "++inference.top_p=1.0 "
        "++adaptive_reasoning=True "
        "++refinement=True "
        "++judge_enabled=True "
        "++refinement_max_turns=8 "
        "++backtranslation_prompt_config=lean4/backtranslation "
        "++judge_prompt_config=lean4/judge-backtranslation "
        "++refine_consistent_error_prompt_config=lean4/refinement_consistent_error "
        "++refine_parsing_error_prompt_config=lean4/refinement_parsing_error "
        "++refine_code_error_prompt_config=lean4/refinement_code_error "
    ),
    generation_module="nemo_skills.inference.autoformalize",
    cluster="slurm",
    server_gpus="<NUM_GPUS>",
    input_file="/workspace/data/problems.jsonl",
    output_dir="/workspace/data/autoformalize_output",
    server_type="vllm",
    model="openai/gpt-oss-120b",
    with_sandbox=True,
    num_random_seeds=1,
)

The pipeline includes:

  • Initial Formalization: LLM translates natural language to Lean 4 code
  • Compilation Check: Lean 4 sandbox verifies syntactic correctness
  • Backtranslation & Verification: Formal code is backtranslated and compared to the original
  • Iterative Refinement: Up to 8 iterations to fix parsing, compilation, or semantic errors
  • Adaptive Reasoning: Automatically reduces reasoning effort when hitting context limits

Theorem Proving

The prover pipeline generates proofs for formalized statements with iterative error correction. Input: formal statements from the autoformalization step (the formal_statement and lean_header fields from the dataset).

ns generate \
    --cluster=slurm \
    --generation_module=nemo_skills.inference.prover \
    --model=Goedel-LM/Goedel-Prover-V2-32B \
    --server_type=vllm \
    --server_gpus=<NUM_GPUS> \
    --server_args="--max-model-len 40960" \
    --input_file=/workspace/data/formal_statements.jsonl \
    --output_dir=/workspace/data/proofs_output \
    --with_sandbox \
    --num_random_seeds=1 \
    ++prompt_config=lean4/goedel-prover-v2 \
    ++inference.tokens_to_generate=38912 \
    ++inference.temperature=1.0 \
    ++inference.top_p=0.95 \
    ++refinement=True \
    ++refinement_max_turns=8 \
    ++remove_cot=True \
    ++n_pass=4 \
    ++refinement_prompt_config=lean4/goedel-prover-v2-refinement \
    ++delete_wrong_turns=True \
    ++max_concurrent_requests=512
from nemo_skills.pipeline.cli import generate, wrap_arguments

generate(
    ctx=wrap_arguments(
        "++inference.tokens_to_generate=38912 "
        "++inference.temperature=1.0 "
        "++inference.top_p=0.95 "
        "++prompt_config=lean4/goedel-prover-v2 "
        "++refinement=True "
        "++refinement_max_turns=8 "
        "++remove_cot=True "
        "++n_pass=4 "
        "++refinement_prompt_config=lean4/goedel-prover-v2-refinement "
        "++delete_wrong_turns=True "
        "++max_concurrent_requests=512 "
    ),
    generation_module="nemo_skills.inference.prover",
    cluster="slurm",
    input_file="/workspace/data/formal_statements.jsonl",
    output_dir="/workspace/data/proofs_output",
    model="Goedel-LM/Goedel-Prover-V2-32B",
    server_type="vllm",
    server_gpus="<NUM_GPUS>",
    server_args="--max-model-len 40960",
    num_random_seeds=1,
    with_sandbox=True,
)

The proving strategy includes:

  • Chain-of-thought removal: Strips reasoning, keeping only formal proof code
  • Wrong turn deletion: Discards failed attempts to prevent context pollution
  • Structured error feedback: Compiler errors annotated with <error></error> tags
  • Pass@N with refinement: Multiple independent attempts, each with iterative refinement

Model Training

To fine-tune a model on the Nemotron-Math-Proofs dataset. Input: processed SFT data from the theorem proving step (the messages field from the dataset contains verified proof conversations).

ns sft_nemo_rl \
    --cluster=slurm \
    --expname=qwen3-8b-lean-sft \
    --output_dir=/workspace/training/qwen3-8b-lean-sft \
    --hf_model=Qwen/Qwen3-8B \
    --training_data=/workspace/data/sft_data.jsonl \
    --num_nodes=<NUM_NODES> \
    --num_gpus=<NUM_GPUS> \
    --backend=megatron \
    ++checkpointing.save_period=250 \
    ++sft.max_num_epochs=2000 \
    ++sft.max_num_steps=1000 \
    ++policy.megatron_cfg.tensor_model_parallel_size=4 \
    ++policy.megatron_cfg.pipeline_model_parallel_size=1 \
    ++policy.megatron_cfg.context_parallel_size=4 \
    ++policy.train_global_batch_size=2048 \
    ++policy.max_total_sequence_length=49152 \
    ++policy.megatron_cfg.optimizer.lr=1e-4 \
    ++policy.megatron_cfg.optimizer.min_lr=1e-4 \
    ++policy.megatron_cfg.scheduler.lr_warmup_iters=0
from nemo_skills.pipeline.cli import sft_nemo_rl, wrap_arguments

sft_nemo_rl(
    ctx=wrap_arguments(
        "++checkpointing.save_period=250 "
        "++sft.max_num_epochs=2000 "
        "++sft.max_num_steps=1000 "
        "++policy.megatron_cfg.tensor_model_parallel_size=4 "
        "++policy.megatron_cfg.pipeline_model_parallel_size=1 "
        "++policy.megatron_cfg.context_parallel_size=4 "
        "++policy.train_global_batch_size=2048 "
        "++policy.max_total_sequence_length=49152 "
        "++policy.megatron_cfg.optimizer.lr=1e-4 "
        "++policy.megatron_cfg.optimizer.min_lr=1e-4 "
        "++policy.megatron_cfg.scheduler.lr_warmup_iters=0 "
    ),
    cluster="slurm",
    expname="qwen3-8b-lean-sft",
    backend="megatron",
    output_dir="/workspace/training/qwen3-8b-lean-sft",
    hf_model="Qwen/Qwen3-8B",
    training_data="/workspace/data/sft_data.jsonl",
    num_gpus="<NUM_GPUS>",
    num_nodes="<NUM_NODES>",
)

Model Evaluation on miniF2F

To evaluate a model on the miniF2F benchmark with 32 samples per problem (without self-correction):

ns eval \
    --cluster=slurm \
    --model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --server_type=vllm \
    --server_gpus=<NUM_GPUS> \
    --output_dir=/workspace/evals/nemotron-nano-3-minif2f \
    --benchmarks=minif2f:32 \
    --with_sandbox \
    ++prompt_config=lean4/goedel-prover-v2-nemotron \
    ++inference.tokens_to_generate=120000 \
    ++inference.temperature=1.0 \
    ++inference.top_p=1.0 \
    --extra_eval_args="++eval_config.timeout=600"
from nemo_skills.pipeline.cli import eval, wrap_arguments

eval(
    ctx=wrap_arguments(
        "++prompt_config=lean4/goedel-prover-v2-nemotron "
        "++inference.tokens_to_generate=120000 "
        "++inference.temperature=1.0 "
        "++inference.top_p=1.0 "
    ),
    cluster="slurm",
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    server_type="vllm",
    server_gpus="<NUM_GPUS>",
    benchmarks="minif2f:32",
    output_dir="/workspace/evals/nemotron-nano-3-minif2f",
    extra_eval_args="++eval_config.timeout=600",
    with_sandbox=True,
)

To evaluate with self-correction (iterative refinement based on compiler feedback):

ns generate \
    --cluster=slurm \
    --generation_module=nemo_skills.inference.prover \
    --model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --server_type=vllm \
    --server_gpus=<NUM_GPUS> \
    --input_file=/nemo_run/code/nemo_skills/dataset/minif2f/test.jsonl \
    --output_dir=/workspace/evals/nemotron-nano-3-minif2f-self-correction \
    --with_sandbox \
    --num_random_seeds=32 \
    ++prompt_config=lean4/goedel-prover-v2-nemotron \
    ++inference.tokens_to_generate=120000 \
    ++inference.temperature=1.0 \
    ++inference.top_p=1.0 \
    ++refinement=True \
    ++refinement_max_turns=8 \
    ++remove_cot=True \
    ++n_pass=1 \
    ++refinement_prompt_config=lean4/goedel-prover-v2-refinement \
    ++delete_wrong_turns=True \
    ++max_concurrent_requests=512
from nemo_skills.pipeline.cli import generate, wrap_arguments

generate(
    ctx=wrap_arguments(
        "++inference.tokens_to_generate=120000 "
        "++inference.temperature=1.0 "
        "++inference.top_p=1.0 "
        "++prompt_config=lean4/goedel-prover-v2-nemotron "
        "++refinement=True "
        "++refinement_max_turns=8 "
        "++remove_cot=True "
        "++n_pass=1 "
        "++refinement_prompt_config=lean4/goedel-prover-v2-refinement "
        "++delete_wrong_turns=True "
        "++max_concurrent_requests=512 "
    ),
    generation_module="nemo_skills.inference.prover",
    cluster="slurm",
    input_file="/nemo_run/code/nemo_skills/dataset/minif2f/test.jsonl",
    output_dir="/workspace/evals/nemotron-nano-3-minif2f-self-correction",
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    server_type="vllm",
    server_gpus="<NUM_GPUS>",
    num_random_seeds=32,
    with_sandbox=True,
)

To summarize evaluation results:

ns summarize_results /workspace/evals/nemotron-nano-3-minif2f/eval-results/minif2f --cluster slurm

Known Limitations

  • Difficulty balance: No explicit normalization of problem difficulty; implicit selection biases from pipeline stages
  • Token/length normalization: Some solutions contain warnings (e.g., unused hypotheses) since verification only checks for compilation errors
  • Placeholder definitions: Some formalizations use trivial placeholder definitions instead of mathlib (e.g., def MyContinuous... Prop := True)