Neural nets that do symbolic mathematics, logic and other reasoning tasks

2019-12-08 — 2026-06-27

Wherein Solver and Prover Models Are Distinguished, the Lean Compiler Is Observed to Function as an Objective Verifier, and the Gradual Consolidation of Both Families Into Agent Harnesses Is Noted.

compsci
language
machine learning
meta learning
networks
neural nets
NLP
stringology

Much of this is slop rn, expanded from my dot points and research notes publishing because it is useful, but expect a lot of edits as I clean it up and add the applied notes.

Somewhere between computational symbolic mathematics, automated proof assistants and the large language models may be found the models that solve mathematical problems. General-purpose LLMs do OK at this; we have spent the last few years working, with variable success, on specializations which do better.

This is the theory page. There is an applied-math companion.

Figure 1

1 Axes of variation

Mathematical AI systems vary along, as far as I can tell, five axes.

The Target is what counts as an answer: a boxed number, a natural-language proof, a formal Lean proof, or even a program bundled with a specification and a machine-checked proof that the two agree. The choice of the type of target conditions much of the rest of the stack.

Granularity is the grain at which we generate and check. A whole answer in one shot; a vote over whole answers; confidence attached to each claim in a trajectory; a proof grown one tactic at a time; or a theorem split into lemmas that are proved separately and reassembled. Does the system decompose a big question into smaller ones iteratively?

Verification is what tells us an answer is right: It could be “looking credible” or a more sophisticated check: a majority vote, a separately-trained reward model that scores candidates, the model grading its own work, or an external oracle like the Lean compiler.

The model can be specialized on maths to a greater or lesser extent: a plain generalist that happens to be good at maths, a solver fine-tuned for natural-language maths, an autoformalizer tuned to turn English into Lean, or a prover tuned to emit Lean proofs.

The harness can wrap the model with varying degrees of sophistication: a bare chat completion, a native tool-call, a tool-integrated sandbox that runs the model’s code, a Lean REPL that compiles its proofs, or a full orchestrator that decomposes problems and dispatches sub-goals to other models.

Not all combinations make sense; the Lean REPL harness only makes sense with a lean-proving model.

2 Generalists, solvers and provers

Most frontier language models are generalists that happen to be decent at maths. They do alright on AIME-style problems as a side effect of being good at language and following instructions. This is the baseline family, and not a negligible one: it is the family we can actually rent per-token. A generalist carries nothing maths-specific in its harness: it runs in pure chain-of-thought — sampled and \boxed-voted — or drives tools through the native tool-call protocol.

There is also a narrower class: models hard-specialized on mathematical reasoning, which trade away chat fluency and/or agentic competence to get it. The specialist family has two branches, and they split on how the answer gets checked — the verification axis, though they differ on target too.

Firstly, there are Solvers. These are more recognizable to the layperson. They are models that answer mathematical problems in natural language. Some of them have additionally been tweaked to run symbolic maths tools to check their working, the so-called TIR sub-family. Qwen2.5-Math (A. Yang et al. 2024), OpenMath-Nemotron (Moshkov et al. 2025), Skywork-OR1, AceMath (Liu et al. 2025) are all examples of these. If we’re used to classic AI chat, we’re used to these, since they are just classic chat programs but with a fixation on calculus. Their output is a chain of reasoning ending in a boxed answer, like a proof in a textbook.

Secondly, there are Provers. The prover model emits primarily Lean 4 (Lean 4 being the ascendant proof assistant language). Examples include Goedel-Prover (Lin et al. 2025), OProver (Ma et al. 2026), DeepSeek-Prover (Ren et al. 2025). Their output is a candidate formal proof; the compiler decides whether it is one. The systems at the top of the formal leaderboards are agents that wrap such a model — or, increasingly, a plain generalist — in a search loop, which I unpick further down.

  • generalist and solver answers have no magical special verifier (although they can use tricks like voting and self-critique to improve their reliability).
  • a prover’s outputs are checked against the Lean compiler, a reliable indicator of formal proof validity.

For each, we might need specialty reasoning harnesses to benefit from the full powers of the algorithm.1.

Hold those three as fixed points. The rest of this page walks the axes between them, and the last section follows the frontier as it drifts from one corner to another.

3 Why are reasoning models even possible?

WarningTODO — stub

Why a next-token predictor should be able to reason at all — and how much of the apparent reasoning is search, memorization, or “true” generalization — deserves a proper treatment I have not written yet.

CoT, computational complexity, program synthesis etc.

Figure 2

In the absence of a proper write up, I am curious about indications that reasoning may be separable from knowledge, because we kind of gave up on that for a while there. VibeThinker-3B (Xu et al. 2026) (weights) reports frontier-level verifiable maths and code from a 3B base (tiny, runs-on-a-phone), and its authors claim this as a compression-coverage split: verifiable reasoning compresses into a small core, while open-domain knowledge and general competence are what actually consume all the parameters. I’m treating this as evidence that a maths specialist might actually be attainable/usable.

Further reading on how and why a sequence model reasons, not yet folded into the above:

4 Reasoning modes in solvers — CoT and TIR

The Solver models break down into two classes — the ones that augment their reasoning with tools, and the ones that just think reeeeealllly hard.

We refer to the latter as the CoT (chain-of-thought) models. In practice, everything uses CoT. But CoT solvers do nothing else; they are bareback, and as such, just totally normal LLMs but nerdier.

The former we call TIR (tool-integrated reasoning). The model writes mathematical code (e.g. in Python), runs it, reads the result back, and carries on, self-correcting any wordcel errors it made with the help of its shape-rotating software stack.

TIR is the harness axis showing up inside the solver family: it needs a special layer, because the model only emits tokens while an orchestrator runs the code in a sandbox and splices the results back, and the model has to be trained to emit that code in a template the harness recognizes — generic harnesses do not work.

5 Test-time scaling

Solvers spend extra inference compute to do better than a single shot, and the interesting variable is the grain at which they spend it. This section is the coarse end of that granularity axis — whole answers and the claims inside them; the fine end, single tactics and lemmas, is the provers, further down.

Notoriously, getting models to self-critique is unreasonably effective (Shao et al. 2025). This is AFAIK still the major trick driving inference-time scaling. Essentially, we can spend more inference compute per question — sampling many candidate solutions and combining them — and beat a much larger single-shot model.2

5.1 Self-consistency — maj@k

The simplest test-time-scaling trick (X. Wang et al. 2023): draw \(k\) independent samples at non-zero temperature3, return the modal value. This is called maj@k.

5.2 Verifier search and best-of-n

When a model ships a reward model alongside the policy (AceMath does (Liu et al. 2025); some specialists do not), we can do better than plain voting: sample \(n\) candidates, score each with the reward model, return the highest-scoring. Best-of-\(n\) against a calibrated verifier beats maj@\(n\) for the same compute budget — the verifier carries information about which sample is right, not just which answer is most common.

Easy to say, but I have no idea how we would get such a reward model in general, nor why I should trust it.

The most famous instance of this trick at scale is OpenAI’s o1 and successors, which reputedly use a learned verifier (Lightman et al. 2023) to do beam-search-like reasoning trace selection inside the model’s chain-of-thought; see s1 (Muennighoff et al. 2025) for an open replication, and the probabilistic inference scaling work (Puri et al. 2025) for the formal framing.

5.3 The model as its own verifier

Where do we get that reward model? One answer: the policy grades itself via self-critique. DeepSeekMath-V2 (Shao et al. 2025) is the large-model that does so; VibeThinker-3B’s Claim-Level Reliability Assessment (Xu et al. 2026) is the small-model one, which shows the move working at 3 billion parameters. Regardless of the model size, the process is similar: the model draws \(k\) trajectories, splits each into a few decision-relevant claims, has the model verify each claim, and weights a trajectory’s vote by how many of its claims hold — so an answer reached through a broken step counts for less than the same answer reached cleanly. That lifts AIME26 from 94.3 to 97.1, which doesn’t sound like a big improvement but corresponds to a nominally much larger model’s performance.

The granularity of these systems is interesting: confidence attaches to each claim, not just the final boxed answer, so the loop can say which steps to trust rather than only which answer won the vote.

The published scores on AIME / GPQA / MATH are usually made from some kind of diversified voting scheme, e.g.

  • maj@\(k\) with \(k \geq 8\),
  • temperature ≠ 0,
  • often with the vendor’s own reward model in the loop (NeMo-Skills’ GenSelect (Moshkov et al. 2025), AceMath’s reward),

A single sample at \(T = 0\) (i.e. classic chat style) will do much worse.

6 Proving theorems with a compiler in the loop

Now, back to the other family, the formal provers, where the verification axis reaches its hard end and the target is a Lean proof rather than a number. Each iteration a model emits a candidate Lean 4 proof, the Lean compiler either accepts it or returns an error trace, and we either keep the proof or feed the error back to the model and retry.

This system has a lot of unusual properties:

  • We have a trustworthy verifier even on problems outside the training data. Unlike a maj@k vote, the compiler can’t be fooled by a self-consistent wrong answer. Either the proof typechecks (we’re done) or it doesn’t (the candidate is wrong). In practice this means we might sample many more candidate options, because even unlikely ones are easy to check. This is called Pass@j where \(j>k\) typically.
  • The self-correction loop is part of the model. The prover doesn’t one-shot a proof. Rather, the prover loop samples many candidates.
  • We are very sensitive to the “software” stack. Lean incorporates its mathematical knowledge in Mathlib; that comprises the bank of mathematics the LLM can call on. Mathlib version-pinning is the canonical cause of mysterious failed proofs.

Harnesses accordingly incorporate lots of Lean-specialist plumbing — Mathlib pinning and a Lean REPL server (e.g. LeanDojo (K. Yang et al. 2023)) — with the tool choices laid out in practice.

Now, most mathematical knowledge is not in fact written up as Lean programs, so these systems tend to need additional support. Specifically, we depend upon Autoformalization — turning a natural-language statement into Lean so a prover can attack it. Specialist autoformalization models exist (e.g. Herald (Gao et al. 2025)) but are not well commoditized. Herald ships safetensors only and won’t magically load up in a local stack. Without good autoformalization, the Lean-prover pipeline starts from a hand-written statement and only the proof is automated — though the newest decomposition systems pull autoformalization inside the loop.

7 Prover architectures

The compiler-in-the-loop story above is a story about one model and one retry loop, and that was a fair picture into early 2025. It is not the picture now. This is the fine end of the granularity axis: the proof gets built a tactic or a lemma at a time, not voted on whole. The single fine-tuned model that emits a whole proof — DeepSeek-Prover, Goedel-Prover — has slid down the PutnamBench leaderboard (Tsoukalas et al. 2024) until it is the baseline everyone else beats. What tops the scores is now an agent.

Axiomatic AI’s “minimal agent” paper (Requena et al. 2026) is the cleanest map I have found, so I will borrow its taxonomy. Provers split first into tree-search methods, which grow a proof tactic-by-tactic and consult Lean after each step (AlphaProof was the flagship), and whole-proof methods, which emit a candidate proof and compile it in one go. Whole-proof has been winning, and it has its own internal ladder:

  • One-shot, sampled wide. Generate \(k\) independent proofs, accept any that compiles. This is the bare Pass@j idea, and on its own it never cracks the hard problems.
  • Iterative refinement. Generate, compile, feed the errors back, retry — the loop from the section above. The minimal-agent ablations rank this as the single largest driver of success, ahead of model choice, tools, or memory. Goedel-Prover-V2’s two-round self-correction is exactly this.
  • Recursive decomposition. Break the theorem into sub-lemmas, prove each, reassemble — where the frontier sits now.

The decomposition systems stopped being a single model some time ago. Apple’s Hilbert (Varambally et al. 2026) runs four parts — a reasoner (a general LLM that does the informal maths and writes a proof sketch full of sorry holes), a prover (a cheap Lean specialist that closes the atomic goals), a retriever (semantic search over Mathlib for relevant lemmas), and the compiler as verifier — and it recurses on any sub-goal it cannot close directly. The choice of reasoner turns out to matter more than the choice of prover. The maths-specialist is the commodity part; the generalist doing the decomposition is where the wins are.

ByteDance’s Seed-Prover (L. Chen et al. 2025; J. Chen et al. 2025) and Harmonic’s Aristotle (Achim et al. 2025) are the same shape with more machinery bolted on — Seed-Prover adds a natural-language prover that drafts the argument in prose, a sketch model that turns the prose into a Lean skeleton, and a dedicated geometry engine; Aristotle interleaves a Monte-Carlo proof search with an informal-reasoning pipeline and retrains itself on its own search traces mid-run. Pull one detail back to the solver/prover taxonomy: autoformalization has moved inside the loop. Above I treated turning English into Lean as a separate, flakier front-end. In these systems the natural-language proof is the plan: the informal argument is drafted first and formalized lemma-by-lemma, with formal failures feeding back to revise the informal sketch. Draft-sketch-prove is the spine now, not an add-on.

Which raises the obvious question — if refinement and decomposition are doing the work, how specialist does the model underneath really need to be? I take that up next.

8 The frontier is shifting

The action seems to be moving away away from specialist models and toward a generalist model checked finely inside a heavy harness.

On the open maths leaderboards the generalists increasingly out-score the narrow solvers. The minimal-agent paper (Requena et al. 2026) makes the point that an off-the-shelf frontier model — they use Claude Opus — wrapped in a feedback-plus-memory-plus-search loop, with no fine-tuning at all, lands competitively against the specialist provers, and gets cheaper and better for free every time the base model improves. “Prover” is turning into the name of a harness, and the thing inside it is drifting from a narrow Lean-emitter towards a generalist.

Answer checking seems to be tending to finer granularity and more formal verification. A model can now bring its own verifier — DeepSeekMath-V2 (Shao et al. 2025) writes natural-language proofs and grades the individual claims therein itself. On the formal side, autoformalization has moved inside the loop, so the brittle English-to-Lean front-end is no longer a separate stage but a step the system revises as it goes.

The weirdest systems close the loop entirely and train mid-run: Harmonic’s Aristotle (Achim et al. 2025) retrains itself on its own search traces while it works.

All of which means the specifics here have a short half-life. Nonetheless I attempted to sketch out an escalation path that seems feasible to implement at home.

9 Benchmarks and data

What is the MNIST/Imagenet of mathematical reasoning? These!

  • GSM8K (grade-school word problems with integer answers) is totally MNIST, which is to say, a smoke test.
  • MATH — people usually use the MATH-500 subset
  • AIME (Mathematics Olympiad recruitment problems with integer answers 0–999) is more like Imagenet, i.e. it has some actually hard problems in.

Grading these problems can be annoyingly fiddly. A MATH answer is LaTeX, where \frac{1}{2} and 0.5 are both correct, so it wants symbolic-equivalence checking rather than string comparison — NeMo-Skills, for example, ships both the datasets and that grader.

These data sets are widely regarded as being mined out. The stringent moving benchmark in MathArena (Dekoninck et al. 2026) mitigates this problem by managing data contamination somehow?

The benchmarks split along the target axis, the same way the systems do. The ones above are all answer benchmarks — they grade a boxed number, which is the solver’s and the generalist’s game. The provers have their own, and these grade a proof, which the compiler either accepts or does not. MiniF2F (Zheng, Han, and Polu 2022) is the old smoke test — a few hundred Olympiad statements in Lean — and the strong provers have all but saturated it. The live target is PutnamBench (Tsoukalas et al. 2024), 600-odd Putnam problems formalized in Lean, where the leaders now solve most of the set and the very top is approaching all of it; FormalMATH (Z. Yu et al. 2025) is the bigger, broader follow-up. A machine-checked leaderboard handles contamination from the opposite end to MathArena: it does not much matter if the statement leaked, because a self-consistent wrong proof still fails to compile, so the score measures proving rather than recall. The same machinery is starting to grade code — VERINA (Z. Ye et al. 2026) and VeriSoftBench (Xin et al. 2026) ask a model to emit a program, a Lean specification, and a machine-checked proof that the two agree, carrying the compiler-as-verifier idea on from theorems to provably-correct software.

10 But is it OK to automate mathematics?

I have complicated feelings about this. On one hand mathematics is upstream from much human progress, so I guess we should at least consider it. On the other hand, training in the punishing and mind-bending discipline of mathematics is IMO one of the highest achievements of the human race, not to mention one of the best ways we know of infusing our squishy meat brains, so losing the incentive to do it will do bad things to us.

I don’t know what to do about that.

11 Incoming

12 References

Achim, Best, Bietti, et al. 2025. Aristotle: IMO-Level Automated Theorem Proving.”
Akyürek, Damani, Qiu, et al. 2024. The Surprising Effectiveness of Test-Time Training for Abstract Reasoning.”
Bansal, Hosseini, Agarwal, et al. 2024. Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling.”
Biddle. 2023. Views of Australians Towards Science and AI.”
Bubeck, Chandrasekaran, Eldan, et al. 2023. Sparks of Artificial General Intelligence: Early Experiments with GPT-4.”
Chen, Jiangjie, Chen, Du, et al. 2025. Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience.”
Chen, Luoxin, Gu, Huang, et al. 2025. Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving.”
Clark, Tafjord, and Richardson. 2020. Transformers as Soft Reasoners over Language.” In IJCAI 2020.
Dehghani, Gouws, Vinyals, et al. 2019. Universal Transformers.” In.
Dekoninck, Jovanović, Gehrunger, et al. 2026. Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs.”
Dziri, Lu, Sclar, et al. 2023. Faith and Fate: Limits of Transformers on Compositionality.”
Fu, Ou, Chen, et al. 2023. Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models’ Reasoning Performance.”
Gao, Wang, Jiang, et al. 2025. Herald: A Natural Language Annotated Lean 4 Dataset.”
Garcez, and Lamb. 2020. Neurosymbolic AI: The 3rd Wave.”
Hao, Sukhbaatar, Su, et al. 2024. Training Large Language Models to Reason in a Continuous Latent Space.”
Kiciman, Ness, Sharma, et al. 2024. Causal Reasoning and Large Language Models: Opening a New Frontier for Causality.” Transactions on Machine Learning Research.
Kwa, West, Becker, et al. 2025. Measuring AI Ability to Complete Long Tasks.”
Lample, and Charton. 2019. Deep Learning for Symbolic Mathematics.” arXiv:1912.01412 [Cs].
Lightman, Kosaraju, Burda, et al. 2023. Let’s Verify Step by Step.”
Lin, Tang, Lyu, et al. 2025. Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction.”
Liu, Chen, Shoeybi, et al. 2025. AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling.”
Lombrozo. 2024. Learning by Thinking in Natural and Artificial Minds.” Trends in Cognitive Sciences.
Mahowald, Ivanova, Blank, et al. 2024. Dissociating Language and Thought in Large Language Models.” Trends in Cognitive Sciences.
Ma, Ma, Guo, et al. 2026. OProver: A Unified Framework for Agentic Formal Theorem Proving.”
Mirzadeh, Alizadeh, Shahrokhi, et al. 2024. GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.”
Moshkov, Hanley, Sorokin, et al. 2025. AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning Dataset.”
Muennighoff, Rush, Barak, et al. 2023. Scaling Data-Constrained Language Models.” Advances in Neural Information Processing Systems.
Muennighoff, Yang, Shi, et al. 2025. S1: Simple Test-Time Scaling.”
Puri, Sudalairaj, Xu, et al. 2025. Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs Using Particle-Based Monte Carlo Methods.”
Radford, Wu, Child, et al. 2019. “Language Models Are Unsupervised Multitask Learners.”
Ren, Shao, Song, et al. 2025. DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition.”
Requena, Letson, Nowakowski, et al. 2026. A Minimal Agent for Automated Theorem Proving.”
Schuurmans, Dai, and Zanini. 2024. Autoregressive Large Language Models Are Computationally Universal.”
Shao, Luo, Lu, et al. 2025. DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning.”
Shao, Wang, Zhu, et al. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.”
Shojaee, Mirzadeh, Alizadeh, et al. 2025. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.”
Tsoukalas, Lee, Jennings, et al. 2024. PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition.”
Varambally, Voice, Sun, et al. 2026. Hilbert: Recursively Building Formal Proofs with Informal Reasoning.”
Venhoff, Arcuschin, Torr, et al. 2025. Base Models Know How to Reason, Thinking Models Learn When.”
Wang, Haiming, Unsal, Lin, et al. 2025. Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning.”
Wang, Xuezhi, Wei, Schuurmans, et al. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models.”
Wei, Wang, Schuurmans, et al. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.”
Wu, Tan, Wang, et al. 2024. Beyond Language Models: Byte Models Are Digital World Simulators.”
Xin, Chen, Durrett, et al. 2026. VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean.”
Xu, Liu, Wang, et al. 2026. VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models.”
Yang, Kaiyu, Swope, Gu, et al. 2023. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models.”
Yang, An, Zhang, Hui, et al. 2024. Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement.”
Ye, Jiacheng, Gong, Chen, et al. 2024. Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models.” In.
Ye, Zhe, Yan, He, et al. 2026. VERINA: Benchmarking Verifiable Code Generation.”
Yue, Chen, Lu, et al. 2025. Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yu, Zhouliang, Peng, Ding, et al. 2025. FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models.”
Yu, Ping, Xu, Weston, et al. 2024. Distilling System 2 into System 1.”
Zhang, Backurs, Bubeck, et al. 2022. Unveiling Transformers with LEGO: A Synthetic Reasoning Task.”
Zheng, Han, and Polu. 2022. MiniF2F: A Cross-System Benchmark for Formal Olympiad-Level Mathematics.”

Footnotes

  1. Mac picks here↩︎

  2. How much compute, and where the GPUs come from, is a lever sized in the applied notes; here I only care about the grain.↩︎

  3. e.g. set \(T \approx 0.6\), \(k = 8\) to \(16\) and reseed↩︎