Nemotron-Math-Proofs¶

Dataset Overview¶

Using our pipelines we created Nemotron-Math-Proofs-v1, a large-scale mathematical reasoning dataset containing approximately 580k natural language proof problems, 550k formalizations into theorem statements in Lean 4, and 900k model-generated reasoning trajectories culminating in Lean 4 proofs.

The dataset integrates human-authored problems with systematically generated formalizations and solution traces:

Natural Language Problems: ~580k proof problems sourced from AoPS forums, Math StackExchange, and MathOverflow, filtered for proof-based problems that can be formulated as theorems, semantically deduplicated, and decontaminated against popular benchmarks (removing ~40% of the original dataset). See our paper for details on problem sources and the extraction pipeline.
Formal Statements: ~550k Lean 4 theorem statements generated via autoformalization
Proof Trajectories: ~900k verified reasoning traces and proofs

Each natural language problem is formalized by gpt-oss-120b into Lean 4 theorem statements. Reasoning traces and proofs are generated by Goedel-Prover-V2-32B and verified by the Lean 4 compiler.

This dataset was used as part of the SFT data for Nemotron-Nano-v3.

Training Results¶

We compare a Qwen3-8B model fine-tuned on this dataset against Goedel-Prover-V2-8B on miniF2F:

Model	pass@32 (no self-correction)	pass@32 (with self-correction)
Goedel-Prover-V2-8B	84.6%	86.7%*
Qwen3-8B SFT on Nemotron-Math-Proofs-v1	85.3%	90.2%**

* Goedel-Prover-V2 uses 3 self-correction attempts (limited by context length). ** Our modified pipeline supports unbounded attempts; we use 8.

Nemotron-Nano-v3 (which includes this dataset in its training) achieves the following on miniF2F (all evaluations use 8 self-correction attempts):

Model	pass@32 (no self-correction)	pass@32 (with self-correction)
Nemotron-Nano-v3	79.92%	86.89%
gpt-oss-20b	43.03%	59.42%
Qwen3-30B-A3B-Thinking	16.80%	32.37%

How to Reproduce¶

Browse the sections below to see commands for autoformalization, theorem proving, training, and evaluation.

Note

These commands assume you have /workspace defined in your cluster config. Adjust paths and cluster settings according to your environment.

Autoformalization¶

The autoformalization pipeline translates natural language theorems into Lean 4 formal statements using an iterative refinement process with backtranslation verification. The input is natural language math problems (the problem field from the dataset)—see OpenMathReasoning dataset construction for how to prepare these.

CLIPython

ns generate \
    --cluster=slurm \
    --generation_module=nemo_skills.inference.autoformalize \
    --model=openai/gpt-oss-120b \
    --server_type=vllm \
    --server_gpus=<NUM_GPUS> \
    --input_file=/workspace/data/problems.jsonl \
    --output_dir=/workspace/data/autoformalize_output \
    --with_sandbox \
    --num_random_seeds=1 \
    ++prompt_config=lean4/autoformalization \
    ++inference.tokens_to_generate=120000 \
    ++inference.temperature=1.0 \
    ++inference.top_p=1.0 \
    ++adaptive_reasoning=True \
    ++refinement=True \
    ++judge_enabled=True \
    ++refinement_max_turns=8 \
    ++backtranslation_prompt_config=lean4/backtranslation \
    ++judge_prompt_config=lean4/judge-backtranslation \
    ++refine_consistent_error_prompt_config=lean4/refinement_consistent_error \
    ++refine_parsing_error_prompt_config=lean4/refinement_parsing_error \
    ++refine_code_error_prompt_config=lean4/refinement_code_error

from nemo_skills.pipeline.cli import generate, wrap_arguments

generate(
    ctx=wrap_arguments(
        "++inference.tokens_to_generate=120000 "
        "++prompt_config=lean4/autoformalization "
        "++inference.temperature=1.0 "
        "++inference.top_p=1.0 "
        "++adaptive_reasoning=True "
        "++refinement=True "
        "++judge_enabled=True "
        "++refinement_max_turns=8 "
        "++backtranslation_prompt_config=lean4/backtranslation "
        "++judge_prompt_config=lean4/judge-backtranslation "
        "++refine_consistent_error_prompt_config=lean4/refinement_consistent_error "
        "++refine_parsing_error_prompt_config=lean4/refinement_parsing_error "
        "++refine_code_error_prompt_config=lean4/refinement_code_error "
    ),
    generation_module="nemo_skills.inference.autoformalize",
    cluster="slurm",
    server_gpus="<NUM_GPUS>",
    input_file="/workspace/data/problems.jsonl",
    output_dir="/workspace/data/autoformalize_output",
    server_type="vllm",
    model="openai/gpt-oss-120b",
    with_sandbox=True,
    num_random_seeds=1,
)

The pipeline includes:

Initial Formalization: LLM translates natural language to Lean 4 code
Compilation Check: Lean 4 sandbox verifies syntactic correctness
Backtranslation & Verification: Formal code is backtranslated and compared to the original
Iterative Refinement: Up to 8 iterations to fix parsing, compilation, or semantic errors
Adaptive Reasoning: Automatically reduces reasoning effort when hitting context limits

Theorem Proving¶

The prover pipeline generates proofs for formalized statements with iterative error correction. Input: formal statements from the autoformalization step (the formal_statement and lean_header fields from the dataset).

CLIPython

ns generate \
    --cluster=slurm \
    --generation_module=nemo_skills.inference.prover \
    --model=Goedel-LM/Goedel-Prover-V2-32B \
    --server_type=vllm \
    --server_gpus=<NUM_GPUS> \
    --server_args="--max-model-len 40960" \
    --input_file=/workspace/data/formal_statements.jsonl \
    --output_dir=/workspace/data/proofs_output \
    --with_sandbox \
    --num_random_seeds=1 \
    ++prompt_config=lean4/goedel-prover-v2 \
    ++inference.tokens_to_generate=38912 \
    ++inference.temperature=1.0 \
    ++inference.top_p=0.95 \
    ++refinement=True \
    ++refinement_max_turns=8 \
    ++remove_cot=True \
    ++n_pass=4 \
    ++refinement_prompt_config=lean4/goedel-prover-v2-refinement \
    ++delete_wrong_turns=True \
    ++max_concurrent_requests=512

from nemo_skills.pipeline.cli import generate, wrap_arguments

generate(
    ctx=wrap_arguments(
        "++inference.tokens_to_generate=38912 "
        "++inference.temperature=1.0 "
        "++inference.top_p=0.95 "
        "++prompt_config=lean4/goedel-prover-v2 "
        "++refinement=True "
        "++refinement_max_turns=8 "
        "++remove_cot=True "
        "++n_pass=4 "
        "++refinement_prompt_config=lean4/goedel-prover-v2-refinement "
        "++delete_wrong_turns=True "
        "++max_concurrent_requests=512 "
    ),
    generation_module="nemo_skills.inference.prover",
    cluster="slurm",
    input_file="/workspace/data/formal_statements.jsonl",
    output_dir="/workspace/data/proofs_output",
    model="Goedel-LM/Goedel-Prover-V2-32B",
    server_type="vllm",
    server_gpus="<NUM_GPUS>",
    server_args="--max-model-len 40960",
    num_random_seeds=1,
    with_sandbox=True,
)

The proving strategy includes:

Chain-of-thought removal: Strips reasoning, keeping only formal proof code
Wrong turn deletion: Discards failed attempts to prevent context pollution
Structured error feedback: Compiler errors annotated with <error></error> tags
Pass@N with refinement: Multiple independent attempts, each with iterative refinement

Model Training¶

To fine-tune a model on the Nemotron-Math-Proofs dataset. Input: processed SFT data from the theorem proving step (the messages field from the dataset contains verified proof conversations).

CLIPython

ns sft_nemo_rl \
    --cluster=slurm \
    --expname=qwen3-8b-lean-sft \
    --output_dir=/workspace/training/qwen3-8b-lean-sft \
    --hf_model=Qwen/Qwen3-8B \
    --training_data=/workspace/data/sft_data.jsonl \
    --num_nodes=<NUM_NODES> \
    --num_gpus=<NUM_GPUS> \
    --backend=megatron \
    ++checkpointing.save_period=250 \
    ++sft.max_num_epochs=2000 \
    ++sft.max_num_steps=1000 \
    ++policy.megatron_cfg.tensor_model_parallel_size=4 \
    ++policy.megatron_cfg.pipeline_model_parallel_size=1 \
    ++policy.megatron_cfg.context_parallel_size=4 \
    ++policy.train_global_batch_size=2048 \
    ++policy.max_total_sequence_length=49152 \
    ++policy.megatron_cfg.optimizer.lr=1e-4 \
    ++policy.megatron_cfg.optimizer.min_lr=1e-4 \
    ++policy.megatron_cfg.scheduler.lr_warmup_iters=0

from nemo_skills.pipeline.cli import sft_nemo_rl, wrap_arguments

sft_nemo_rl(
    ctx=wrap_arguments(
        "++checkpointing.save_period=250 "
        "++sft.max_num_epochs=2000 "
        "++sft.max_num_steps=1000 "
        "++policy.megatron_cfg.tensor_model_parallel_size=4 "
        "++policy.megatron_cfg.pipeline_model_parallel_size=1 "
        "++policy.megatron_cfg.context_parallel_size=4 "
        "++policy.train_global_batch_size=2048 "
        "++policy.max_total_sequence_length=49152 "
        "++policy.megatron_cfg.optimizer.lr=1e-4 "
        "++policy.megatron_cfg.optimizer.min_lr=1e-4 "
        "++policy.megatron_cfg.scheduler.lr_warmup_iters=0 "
    ),
    cluster="slurm",
    expname="qwen3-8b-lean-sft",
    backend="megatron",
    output_dir="/workspace/training/qwen3-8b-lean-sft",
    hf_model="Qwen/Qwen3-8B",
    training_data="/workspace/data/sft_data.jsonl",
    num_gpus="<NUM_GPUS>",
    num_nodes="<NUM_NODES>",
)

Model Evaluation on miniF2F¶

To evaluate a model on the miniF2F benchmark with 32 samples per problem (without self-correction):

CLIPython

ns eval \
    --cluster=slurm \
    --model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --server_type=vllm \
    --server_gpus=<NUM_GPUS> \
    --output_dir=/workspace/evals/nemotron-nano-3-minif2f \
    --benchmarks=minif2f:32 \
    --with_sandbox \
    ++prompt_config=lean4/goedel-prover-v2-nemotron \
    ++inference.tokens_to_generate=120000 \
    ++inference.temperature=1.0 \
    ++inference.top_p=1.0 \
    --extra_eval_args="++eval_config.timeout=600"

from nemo_skills.pipeline.cli import eval, wrap_arguments

eval(
    ctx=wrap_arguments(
        "++prompt_config=lean4/goedel-prover-v2-nemotron "
        "++inference.tokens_to_generate=120000 "
        "++inference.temperature=1.0 "
        "++inference.top_p=1.0 "
    ),
    cluster="slurm",
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    server_type="vllm",
    server_gpus="<NUM_GPUS>",
    benchmarks="minif2f:32",
    output_dir="/workspace/evals/nemotron-nano-3-minif2f",
    extra_eval_args="++eval_config.timeout=600",
    with_sandbox=True,
)

To evaluate with self-correction (iterative refinement based on compiler feedback):

CLIPython

ns generate \
    --cluster=slurm \
    --generation_module=nemo_skills.inference.prover \
    --model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --server_type=vllm \
    --server_gpus=<NUM_GPUS> \
    --input_file=/nemo_run/code/nemo_skills/dataset/minif2f/test.jsonl \
    --output_dir=/workspace/evals/nemotron-nano-3-minif2f-self-correction \
    --with_sandbox \
    --num_random_seeds=32 \
    ++prompt_config=lean4/goedel-prover-v2-nemotron \
    ++inference.tokens_to_generate=120000 \
    ++inference.temperature=1.0 \
    ++inference.top_p=1.0 \
    ++refinement=True \
    ++refinement_max_turns=8 \
    ++remove_cot=True \
    ++n_pass=1 \
    ++refinement_prompt_config=lean4/goedel-prover-v2-refinement \
    ++delete_wrong_turns=True \
    ++max_concurrent_requests=512

from nemo_skills.pipeline.cli import generate, wrap_arguments

generate(
    ctx=wrap_arguments(
        "++inference.tokens_to_generate=120000 "
        "++inference.temperature=1.0 "
        "++inference.top_p=1.0 "
        "++prompt_config=lean4/goedel-prover-v2-nemotron "
        "++refinement=True "
        "++refinement_max_turns=8 "
        "++remove_cot=True "
        "++n_pass=1 "
        "++refinement_prompt_config=lean4/goedel-prover-v2-refinement "
        "++delete_wrong_turns=True "
        "++max_concurrent_requests=512 "
    ),
    generation_module="nemo_skills.inference.prover",
    cluster="slurm",
    input_file="/nemo_run/code/nemo_skills/dataset/minif2f/test.jsonl",
    output_dir="/workspace/evals/nemotron-nano-3-minif2f-self-correction",
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    server_type="vllm",
    server_gpus="<NUM_GPUS>",
    num_random_seeds=32,
    with_sandbox=True,
)

To summarize evaluation results:

ns summarize_results /workspace/evals/nemotron-nano-3-minif2f/eval-results/minif2f --cluster slurm

Known Limitations¶

Difficulty balance: No explicit normalization of problem difficulty; implicit selection biases from pipeline stages
Token/length normalization: Some solutions contain warnings (e.g., unused hypotheses) since verification only checks for compilation errors
Placeholder definitions: Some formalizations use trivial placeholder definitions instead of mathlib (e.g., def MyContinuous... Prop := True)