Neural nets that do symbolic mathematics, logic and other reasoning tasks
2019-12-08 — 2026-06-27
Wherein Solver and Prover Models Are Distinguished, the Lean Compiler Is Observed to Function as an Objective Verifier, and the Gradual Consolidation of Both Families Into Agent Harnesses Is Noted.
Somewhere between computational symbolic mathematics, automated proof assistants and the large language models may be found the models that solve mathematical problems. General-purpose LLMs do OK at this; we have spent the last few years working, with variable success, on specializations which do better.
This is the theory page. There is an applied-math companion.
1 Axes of variation
Mathematical AI systems vary along, as far as I can tell, five axes.
The Target is what counts as an answer: a boxed number, a natural-language proof, a formal Lean proof, or even a program bundled with a specification and a machine-checked proof that the two agree. The choice of the type of target conditions much of the rest of the stack.
Granularity is the grain at which we generate and check. A whole answer in one shot; a vote over whole answers; confidence attached to each claim in a trajectory; a proof grown one tactic at a time; or a theorem split into lemmas that are proved separately and reassembled. Does the system decompose a big question into smaller ones iteratively?
Verification is what tells us an answer is right: It could be “looking credible” or a more sophisticated check: a majority vote, a separately-trained reward model that scores candidates, the model grading its own work, or an external oracle like the Lean compiler.
The model can be specialized on maths to a greater or lesser extent: a plain generalist that happens to be good at maths, a solver fine-tuned for natural-language maths, an autoformalizer tuned to turn English into Lean, or a prover tuned to emit Lean proofs.
The harness can wrap the model with varying degrees of sophistication: a bare chat completion, a native tool-call, a tool-integrated sandbox that runs the model’s code, a Lean REPL that compiles its proofs, or a full orchestrator that decomposes problems and dispatches sub-goals to other models.
Not all combinations make sense; the Lean REPL harness only makes sense with a lean-proving model.
2 Generalists, solvers and provers
Most frontier language models are generalists that happen to be decent at maths. They do alright on AIME-style problems as a side effect of being good at language and following instructions. This is the baseline family, and not a negligible one: it is the family we can actually rent per-token. A generalist carries nothing maths-specific in its harness: it runs in pure chain-of-thought — sampled and \boxed-voted — or drives tools through the native tool-call protocol.
There is also a narrower class: models hard-specialized on mathematical reasoning, which trade away chat fluency and/or agentic competence to get it. The specialist family has two branches, and they split on how the answer gets checked — the verification axis, though they differ on target too.
Firstly, there are Solvers. These are more recognizable to the layperson. They are models that answer mathematical problems in natural language. Some of them have additionally been tweaked to run symbolic maths tools to check their working, the so-called TIR sub-family. Qwen2.5-Math (A. Yang et al. 2024), OpenMath-Nemotron (Moshkov et al. 2025), Skywork-OR1, AceMath (Liu et al. 2025) are all examples of these. If we’re used to classic AI chat, we’re used to these, since they are just classic chat programs but with a fixation on calculus. Their output is a chain of reasoning ending in a boxed answer, like a proof in a textbook.
Secondly, there are Provers. The prover model emits primarily Lean 4 (Lean 4 being the ascendant proof assistant language). Examples include Goedel-Prover (Lin et al. 2025), OProver (Ma et al. 2026), DeepSeek-Prover (Ren et al. 2025). Their output is a candidate formal proof; the compiler decides whether it is one. The systems at the top of the formal leaderboards are agents that wrap such a model — or, increasingly, a plain generalist — in a search loop, which I unpick further down.
- generalist and solver answers have no magical special verifier (although they can use tricks like voting and self-critique to improve their reliability).
- a prover’s outputs are checked against the Lean compiler, a reliable indicator of formal proof validity.
For each, we might need specialty reasoning harnesses to benefit from the full powers of the algorithm.1.
Hold those three as fixed points. The rest of this page walks the axes between them, and the last section follows the frontier as it drifts from one corner to another.
3 Why are reasoning models even possible?
Why a next-token predictor should be able to reason at all — and how much of the apparent reasoning is search, memorization, or “true” generalization — deserves a proper treatment I have not written yet.
CoT, computational complexity, program synthesis etc.
In the absence of a proper write up, I am curious about indications that reasoning may be separable from knowledge, because we kind of gave up on that for a while there. VibeThinker-3B (Xu et al. 2026) (weights) reports frontier-level verifiable maths and code from a 3B base (tiny, runs-on-a-phone), and its authors claim this as a compression-coverage split: verifiable reasoning compresses into a small core, while open-domain knowledge and general competence are what actually consume all the parameters. I’m treating this as evidence that a maths specialist might actually be attainable/usable.
Further reading on how and why a sequence model reasons, not yet folded into the above:
4 Reasoning modes in solvers — CoT and TIR
The Solver models break down into two classes — the ones that augment their reasoning with tools, and the ones that just think reeeeealllly hard.
We refer to the latter as the CoT (chain-of-thought) models. In practice, everything uses CoT. But CoT solvers do nothing else; they are bareback, and as such, just totally normal LLMs but nerdier.
The former we call TIR (tool-integrated reasoning). The model writes mathematical code (e.g. in Python), runs it, reads the result back, and carries on, self-correcting any wordcel errors it made with the help of its shape-rotating software stack.
TIR is the harness axis showing up inside the solver family: it needs a special layer, because the model only emits tokens while an orchestrator runs the code in a sandbox and splices the results back, and the model has to be trained to emit that code in a template the harness recognizes — generic harnesses do not work.
5 Test-time scaling
Solvers spend extra inference compute to do better than a single shot, and the interesting variable is the grain at which they spend it. This section is the coarse end of that granularity axis — whole answers and the claims inside them; the fine end, single tactics and lemmas, is the provers, further down.
Notoriously, getting models to self-critique is unreasonably effective (Shao et al. 2025). This is AFAIK still the major trick driving inference-time scaling. Essentially, we can spend more inference compute per question — sampling many candidate solutions and combining them — and beat a much larger single-shot model.2
5.1 Self-consistency — maj@k
The simplest test-time-scaling trick (X. Wang et al. 2023): draw \(k\) independent samples at non-zero temperature3, return the modal value. This is called maj@k.
5.2 Verifier search and best-of-n
When a model ships a reward model alongside the policy (AceMath does (Liu et al. 2025); some specialists do not), we can do better than plain voting: sample \(n\) candidates, score each with the reward model, return the highest-scoring. Best-of-\(n\) against a calibrated verifier beats maj@\(n\) for the same compute budget — the verifier carries information about which sample is right, not just which answer is most common.
Easy to say, but I have no idea how we would get such a reward model in general, nor why I should trust it.
The most famous instance of this trick at scale is OpenAI’s o1 and successors, which reputedly use a learned verifier (Lightman et al. 2023) to do beam-search-like reasoning trace selection inside the model’s chain-of-thought; see s1 (Muennighoff et al. 2025) for an open replication, and the probabilistic inference scaling work (Puri et al. 2025) for the formal framing.
5.3 The model as its own verifier
Where do we get that reward model? One answer: the policy grades itself via self-critique. DeepSeekMath-V2 (Shao et al. 2025) is the large-model that does so; VibeThinker-3B’s Claim-Level Reliability Assessment (Xu et al. 2026) is the small-model one, which shows the move working at 3 billion parameters. Regardless of the model size, the process is similar: the model draws \(k\) trajectories, splits each into a few decision-relevant claims, has the model verify each claim, and weights a trajectory’s vote by how many of its claims hold — so an answer reached through a broken step counts for less than the same answer reached cleanly. That lifts AIME26 from 94.3 to 97.1, which doesn’t sound like a big improvement but corresponds to a nominally much larger model’s performance.
The granularity of these systems is interesting: confidence attaches to each claim, not just the final boxed answer, so the loop can say which steps to trust rather than only which answer won the vote.
The published scores on AIME / GPQA / MATH are usually made from some kind of diversified voting scheme, e.g.
- maj@\(k\) with \(k \geq 8\),
- temperature ≠ 0,
- often with the vendor’s own reward model in the loop (NeMo-Skills’ GenSelect (Moshkov et al. 2025), AceMath’s reward),
A single sample at \(T = 0\) (i.e. classic chat style) will do much worse.
6 Proving theorems with a compiler in the loop
Now, back to the other family, the formal provers, where the verification axis reaches its hard end and the target is a Lean proof rather than a number. Each iteration a model emits a candidate Lean 4 proof, the Lean compiler either accepts it or returns an error trace, and we either keep the proof or feed the error back to the model and retry.
This system has a lot of unusual properties:
- We have a trustworthy verifier even on problems outside the training data. Unlike a maj@k vote, the compiler can’t be fooled by a self-consistent wrong answer. Either the proof typechecks (we’re done) or it doesn’t (the candidate is wrong). In practice this means we might sample many more candidate options, because even unlikely ones are easy to check. This is called Pass@j where \(j>k\) typically.
- The self-correction loop is part of the model. The prover doesn’t one-shot a proof. Rather, the prover loop samples many candidates.
- We are very sensitive to the “software” stack. Lean incorporates its mathematical knowledge in Mathlib; that comprises the bank of mathematics the LLM can call on. Mathlib version-pinning is the canonical cause of mysterious failed proofs.
Harnesses accordingly incorporate lots of Lean-specialist plumbing — Mathlib pinning and a Lean REPL server (e.g. LeanDojo (K. Yang et al. 2023)) — with the tool choices laid out in practice.
Now, most mathematical knowledge is not in fact written up as Lean programs, so these systems tend to need additional support. Specifically, we depend upon Autoformalization — turning a natural-language statement into Lean so a prover can attack it. Specialist autoformalization models exist (e.g. Herald (Gao et al. 2025)) but are not well commoditized. Herald ships safetensors only and won’t magically load up in a local stack. Without good autoformalization, the Lean-prover pipeline starts from a hand-written statement and only the proof is automated — though the newest decomposition systems pull autoformalization inside the loop.
7 Prover architectures
The compiler-in-the-loop story above is a story about one model and one retry loop, and that was a fair picture into early 2025. It is not the picture now. This is the fine end of the granularity axis: the proof gets built a tactic or a lemma at a time, not voted on whole. The single fine-tuned model that emits a whole proof — DeepSeek-Prover, Goedel-Prover — has slid down the PutnamBench leaderboard (Tsoukalas et al. 2024) until it is the baseline everyone else beats. What tops the scores is now an agent.
Axiomatic AI’s “minimal agent” paper (Requena et al. 2026) is the cleanest map I have found, so I will borrow its taxonomy. Provers split first into tree-search methods, which grow a proof tactic-by-tactic and consult Lean after each step (AlphaProof was the flagship), and whole-proof methods, which emit a candidate proof and compile it in one go. Whole-proof has been winning, and it has its own internal ladder:
- One-shot, sampled wide. Generate \(k\) independent proofs, accept any that compiles. This is the bare Pass@j idea, and on its own it never cracks the hard problems.
- Iterative refinement. Generate, compile, feed the errors back, retry — the loop from the section above. The minimal-agent ablations rank this as the single largest driver of success, ahead of model choice, tools, or memory. Goedel-Prover-V2’s two-round self-correction is exactly this.
- Recursive decomposition. Break the theorem into sub-lemmas, prove each, reassemble — where the frontier sits now.
The decomposition systems stopped being a single model some time ago. Apple’s Hilbert (Varambally et al. 2026) runs four parts — a reasoner (a general LLM that does the informal maths and writes a proof sketch full of sorry holes), a prover (a cheap Lean specialist that closes the atomic goals), a retriever (semantic search over Mathlib for relevant lemmas), and the compiler as verifier — and it recurses on any sub-goal it cannot close directly. The choice of reasoner turns out to matter more than the choice of prover. The maths-specialist is the commodity part; the generalist doing the decomposition is where the wins are.
ByteDance’s Seed-Prover (L. Chen et al. 2025; J. Chen et al. 2025) and Harmonic’s Aristotle (Achim et al. 2025) are the same shape with more machinery bolted on — Seed-Prover adds a natural-language prover that drafts the argument in prose, a sketch model that turns the prose into a Lean skeleton, and a dedicated geometry engine; Aristotle interleaves a Monte-Carlo proof search with an informal-reasoning pipeline and retrains itself on its own search traces mid-run. Pull one detail back to the solver/prover taxonomy: autoformalization has moved inside the loop. Above I treated turning English into Lean as a separate, flakier front-end. In these systems the natural-language proof is the plan: the informal argument is drafted first and formalized lemma-by-lemma, with formal failures feeding back to revise the informal sketch. Draft-sketch-prove is the spine now, not an add-on.
Which raises the obvious question — if refinement and decomposition are doing the work, how specialist does the model underneath really need to be? I take that up next.
8 The frontier is shifting
The action seems to be moving away away from specialist models and toward a generalist model checked finely inside a heavy harness.
On the open maths leaderboards the generalists increasingly out-score the narrow solvers. The minimal-agent paper (Requena et al. 2026) makes the point that an off-the-shelf frontier model — they use Claude Opus — wrapped in a feedback-plus-memory-plus-search loop, with no fine-tuning at all, lands competitively against the specialist provers, and gets cheaper and better for free every time the base model improves. “Prover” is turning into the name of a harness, and the thing inside it is drifting from a narrow Lean-emitter towards a generalist.
Answer checking seems to be tending to finer granularity and more formal verification. A model can now bring its own verifier — DeepSeekMath-V2 (Shao et al. 2025) writes natural-language proofs and grades the individual claims therein itself. On the formal side, autoformalization has moved inside the loop, so the brittle English-to-Lean front-end is no longer a separate stage but a step the system revises as it goes.
The weirdest systems close the loop entirely and train mid-run: Harmonic’s Aristotle (Achim et al. 2025) retrains itself on its own search traces while it works.
All of which means the specifics here have a short half-life. Nonetheless I attempted to sketch out an escalation path that seems feasible to implement at home.
9 Benchmarks and data
What is the MNIST/Imagenet of mathematical reasoning? These!
- GSM8K (grade-school word problems with integer answers) is totally MNIST, which is to say, a smoke test.
- MATH — people usually use the MATH-500 subset
- AIME (Mathematics Olympiad recruitment problems with integer answers 0–999) is more like Imagenet, i.e. it has some actually hard problems in.
Grading these problems can be annoyingly fiddly. A MATH answer is LaTeX, where \frac{1}{2} and 0.5 are both correct, so it wants symbolic-equivalence checking rather than string comparison — NeMo-Skills, for example, ships both the datasets and that grader.
These data sets are widely regarded as being mined out. The stringent moving benchmark in MathArena (Dekoninck et al. 2026) mitigates this problem by managing data contamination somehow?
The benchmarks split along the target axis, the same way the systems do. The ones above are all answer benchmarks — they grade a boxed number, which is the solver’s and the generalist’s game. The provers have their own, and these grade a proof, which the compiler either accepts or does not. MiniF2F (Zheng, Han, and Polu 2022) is the old smoke test — a few hundred Olympiad statements in Lean — and the strong provers have all but saturated it. The live target is PutnamBench (Tsoukalas et al. 2024), 600-odd Putnam problems formalized in Lean, where the leaders now solve most of the set and the very top is approaching all of it; FormalMATH (Z. Yu et al. 2025) is the bigger, broader follow-up. A machine-checked leaderboard handles contamination from the opposite end to MathArena: it does not much matter if the statement leaked, because a self-consistent wrong proof still fails to compile, so the score measures proving rather than recall. The same machinery is starting to grade code — VERINA (Z. Ye et al. 2026) and VeriSoftBench (Xin et al. 2026) ask a model to emit a program, a Lean specification, and a machine-checked proof that the two agree, carrying the compiler-as-verifier idea on from theorems to provably-correct software.
10 But is it OK to automate mathematics?
I have complicated feelings about this. On one hand mathematics is upstream from much human progress, so I guess we should at least consider it. On the other hand, training in the punishing and mind-bending discipline of mathematics is IMO one of the highest achievements of the human race, not to mention one of the best ways we know of infusing our squishy meat brains, so losing the incentive to do it will do bad things to us.
I don’t know what to do about that.
11 Incoming
- Improving Mathematical Reasoning with Process Supervision
- sanderwood/bgpt: Beyond Language Models: Byte Models are Digital World Simulators (Wu et al. 2024)
- videlalvaro/llm-arithmetic-internals: Mechanistic experiments on how LLMs represent and compute arithmetic internally, with strict no-parser controls, reproducible audits, and an interactive article.
12 References
Footnotes
How much compute, and where the GPUs come from, is a lever sized in the applied notes; here I only care about the grain.↩︎
e.g. set \(T \approx 0.6\), \(k = 8\) to \(16\) and reseed↩︎

