Neural nets that do symbolic mathematics, logic and other reasoning tasks
2019-12-08 — 2026-06-08
Wherein Solver and Prover Models Are Distinguished, the Lean Compiler Is Observed to Function as an Objective Verifier, and the Gradual Consolidation of Both Families Into Agent Harnesses Is Noted.
Somewhere between computational symbolic mathematics, automated proof assistants and the large language models sit models that solve mathematical problems more effectively than my feeble brain. General-purpose LLMs do OK at this; We have spent the last few years working, with variable success, on specialisations which do better.
This is the theory page: how the families work and what verifies them. For practice — endpoints, loops, sandboxes — see Maths and proof models, applied.
1 Generalists, solvers and provers
Most frontier language models are generalists that happen to be decent at maths. They do alright on AIME-style problems as a side effect of being good at language and following instructions. This is the baseline family, and not a negligible one — it is the family we can actually rent per-token, and on the open maths leaderboards the generalists increasingly out-score the narrow specialists, a moving target I take up in the applied notes. A generalist carries nothing maths-specific in its harness: it runs in pure chain-of-thought — sampled and \boxed-voted — or drives tools through the native tool-call protocol.
There is also a narrower class: models hard-specialised on mathematical reasoning, which trade away chat fluency and/or agentic competence to get it. The specialist family has two branches, and they split on how the answer gets checked.
Firstly, there are Solvers. These are more recognisable. They are models that answer mathematical problems in natural language. Some of them have additionally been tweaked to run symbolic maths tools to check their working, the so-called TIR sub-family. Qwen2.5-Math (A. Yang et al. 2024), OpenMath-Nemotron (Moshkov et al. 2025), Skywork-OR1, AceMath (Liu et al. 2025) are all examples of these. If we’re used to classic AI chat, we’re used to these, since they are just classic chat programs but with a fixation on calculus. Their output is a chain of reasoning ending in a boxed answer, like a proof in a textbook.
Secondly, there are Provers. The prover model emits Lean 4 and very little else useful (Lean 4 being the ascendant proof assistant language). Examples include Goedel-Prover (Lin et al. 2025), OProver (Ma et al. 2026), DeepSeek-Prover (Ren et al. 2025). Their output is a candidate formal proof; the compiler decides whether it is one. One caveat I keep having to flag to myself: “a model that emits Lean and nothing else” describes the 2024-vintage prover, and that model has become a part rather than the whole. The systems at the top of the formal leaderboards are agents that wrap such a model — or, increasingly, a plain generalist — in a search loop, which I unpick further down.
Spelled out, the check differs across the three families:
- a generalist has no verifier of its own — we check it the way we check a solver, against a known correct answer (or a maj@k vote), which is exactly why a model that checks its own work is notable;
- a solver’s answers are AFAICT checked against a known correct answer (so we need either to know or guess one), while
- a prover’s outputs are checked against the Lean compiler (a reliable indicator of formal proof validity).
For each, we might need specialty reasoning harnesses to benefit from the full powers of the algorithm.1. See Maths and proof models, applied for specifics and pragmatics.
These three are families of models, and treating them as a closed taxonomy is already a 2024 move. Two seams are opening. A model can carry its own verifier: DeepSeekMath-V2 (Shao et al. 2025) writes natural-language proofs and grades them itself, sitting between the solver that leans on a known answer and the prover that leans on the compiler, beholden to neither. And prover is sliding from a kind of model to a variant of the classic agentic system — the “search-and-decompose loop”.
2 Reasoning modes in Solvers — CoT and TIR
The Solver models break down into two classes — the ones that augment their reasoning with tools, and the ones that just think reeeeealllly hard.
We refer to the latter as the CoT (chain-of-thought) models. In practice, everything uses CoT. But CoT solvers do nothing else; they are bareback, and as such, just totally normal LLMs but nerdier.
The former we call TIR (tool-integrated reasoning). The model writes mathematical code (e.g. in Python), runs it, reads the result back, and carries on, self-correcting any wordcel errors it made with the help of its shape-rotating software stack.
TIR requires a special harness layer: the model server just emits tokens, while an orchestrator drives a sandbox that runs the code and splices the results back. The model has to be trained to emit code in a specific template the harness recognises. Generic harnesses do not work.
3 Test time scaling
Notoriously, getting models to self-critique is unreasonably effective (Shao et al. 2025). This is AFAIK still the major trick driving inference-time scaling. Essentially, we can spend more inference compute per question — sampling many candidate solutions and combining them — and beat a much larger single-shot model.
3.1 Self-consistency — maj@k
The simplest test-time-scaling trick (X. Wang et al. 2023): draw \(k\) independent samples at non-zero temperature2, return the modal value. This is called maj@k.
3.2 Verifier search and best-of-n
When a model ships a reward model alongside the policy (AceMath does (Liu et al. 2025); some specialists do not), we can do better than plain voting: sample \(n\) candidates, score each with the reward model, return the highest-scoring. Best-of-\(n\) against a calibrated verifier beats maj@\(n\) for the same compute budget — the verifier carries information about which sample is right, not just which answer is most common.
Easy to say, but I have no idea how we would get such a reward model in general.
The most famous instance of this trick at scale is OpenAI’s o1 and successors, which use a learned verifier (Lightman et al. 2023) to do beam-search-like reasoning trace selection inside the model’s chain-of-thought; see s1 (Muennighoff et al. 2025) for an open replication, and the probabilistic inference scaling work (Puri et al. 2025) for the formal framing.
The published scores on AIME / GPQA / MATH are usually:
- maj@\(k\) with \(k \geq 8\),
- temperature ≠ 0,
- often with the vendor’s own reward model in the loop (NeMo-Skills’ GenSelect (Moshkov et al. 2025), AceMath’s reward),
- and reported on a fresh contamination-checked subset, not the raw benchmark.
A single sample at \(T = 0\) (i.e. classic chat style) will do much worse.
4 Proving theorems with a compiler in the loop
Now, back to the other family, the formal provers. Each iteration a model emits a candidate Lean 4 proof, the Lean compiler either accepts it or returns an error trace, and we either keep the proof or feed the error back to the model and retry.
This system has a lot of unusual properties:
- We have a trustworthy verifier even on problems outside the training data. Unlike a maj@k vote, the compiler can’t be fooled by a self-consistent wrong answer. Either the proof typechecks (we’re done) or it doesn’t (the candidate is wrong). In practice this means we might sample many more candidate options, because even unlikely ones are easy to check. This is called Pass@j where \(j>k\) typically.
- The self-correction loop is part of the model. The prover doesn’t one-shot a proof. Rather, the prover loop samples many candidates.
- We are very sensitive to the “software” stack. Lean incorporates its mathematical knowledge in Mathlib; that comprises the bank of mathematics the LLM can call on. Mathlib version-pinning is the canonical cause of mysterious failed proofs.
Harnesses accordingly incorporate lots of Lean-specialist stuff (lean-repl, Pantograph, LeanDojo (K. Yang et al. 2023), Mathlib pinning) in practice.
Now, most mathematical knowledge is not in fact written up as Lean programs, so these systems tend to need additional support. Specifically, we depend upon Autoformalization — turning a natural-language statement into Lean so a prover can attack it. Specialist autoformalization models exist (e.g. Herald (Gao et al. 2025)) but are not well commoditised. Herald ships safetensors only and won’t magically load up in a local stack. Without good autoformalization, the Lean-prover pipeline starts from a hand-written statement and only the proof is automated — though the newest decomposition systems pull autoformalization inside the loop.
5 Prover architectures
The compiler-in-the-loop story above is a story about one model and one retry loop, and that was a fair picture into early 2025. It is not the picture now. The single fine-tuned model that emits a whole proof — DeepSeek-Prover, Goedel-Prover — has slid down the PutnamBench leaderboard (Tsoukalas et al. 2024) until it is the baseline everyone else beats. What sits on top now is an agent.
Axiomatic AI’s “minimal agent” paper (Requena et al. 2026) is the cleanest map I have found, so I will borrow its carving. Provers split first into tree-search methods, which grow a proof tactic-by-tactic and consult Lean after each step (AlphaProof was the flagship), and whole-proof methods, which emit a candidate proof and compile it in one go. Whole-proof has been winning, and it has its own internal ladder:
- One-shot, sampled wide. Generate \(k\) independent proofs, accept any that compiles. This is the bare Pass@j idea, and on its own it never cracks the hard problems.
- Iterative refinement. Generate, compile, feed the errors back, retry — the loop from the section above. The minimal-agent ablations rank this as the single largest driver of success, ahead of model choice, tools, or memory. Goedel-Prover-V2’s two-round self-correction is exactly this.
- Recursive decomposition. Break the theorem into sub-lemmas, prove each, reassemble — where the frontier sits now.
The decomposition systems are the ones worth knowing, because they stopped being a single model some time ago. Apple’s Hilbert (Varambally et al. 2026) is the legible example: it runs four parts — a reasoner (a general LLM that does the informal maths and writes a proof sketch full of sorry holes), a prover (a cheap Lean specialist that closes the atomic goals), a retriever (semantic search over Mathlib for relevant lemmas), and the compiler as verifier — and it recurses on any sub-goal it cannot close directly. The finding that stuck with me: the choice of reasoner matters more than the choice of prover. The maths-specialist is the commodity part; the generalist doing the decomposition is where the wins are.
ByteDance’s Seed-Prover (L. Chen et al. 2025; J. Chen et al. 2025) and Harmonic’s Aristotle (Achim et al. 2025) are the same shape with more machinery bolted on — Seed-Prover adds a natural-language prover that drafts the argument in prose, a sketch model that turns the prose into a Lean skeleton, and a dedicated geometry engine; Aristotle interleaves a Monte-Carlo proof search with an informal-reasoning pipeline and retrains itself on its own search traces mid-run. The detail worth pulling back to the solver/prover taxonomy is that autoformalization has moved inside the loop. Above I treated turning English into Lean as a separate, flakier front-end. In these systems the natural-language proof is the plan: the informal argument is drafted first and formalised lemma-by-lemma, with formal failures feeding back to revise the informal sketch. Draft-sketch-prove is the spine now, not an add-on.
The consequence is the same softening I note in the applied notes, seen from the prover side. If refinement and decomposition are doing the work, the model underneath need not be a fine-tuned specialist. The minimal-agent paper makes the point bluntly: an off-the-shelf frontier model — they use Claude Opus — wrapped in a feedback-plus-memory-plus-search loop, with no fine-tuning at all, lands competitively against the specialist provers and gets cheaper and better for free every time the base model improves. “Prover” is turning into the name of a harness, and the thing inside the harness is drifting from a narrow Lean-emitter towards a generalist.
Two notes on the current top of the table, as of mid-2026. First, it is closed and pricey. The leaders — Aleph from Logical Intelligence, Seed-Prover 1.5, Aristotle — are API-or-pilot-only, and they report dollars per problem (Aleph’s runs sit around $23–$68 each) rather than a single forward pass. The cheap, open, self-hostable provers I can actually run are the ones near the bottom. Both Seed-Prover and Aristotle cleared gold-medal-equivalent on IMO 2025, five of six problems in machine-checked Lean.3
Second, mind the marketing on the language-free claims. Logical Intelligence frames Aleph as a step beyond LLMs toward an energy-based reasoner that “does not think in words” — but by their own telling the benchmarked Aleph is built on top of an LLM, and the language-free model is a separate, unreleased, as-yet-unbenchmarked thing. So the SOTA PutnamBench number is, for now, another LLM agent rather than evidence that the token-free future has landed. I would like it to be true; I am not yet betting on it.
6 Benchmarks, data
What is the MNIST/Imagenet of mathematical reasoning? These!
- GSM8K (grade-school word problems with integer answers) is totally MNIST, which is to say, a smoke test.
- MATH — people usually use the MATH-500 subset
- AIME (Mathematics Olympiad recruitment problems with integer answers 0–999) is more like Imagenet, i.e. it has some actually hard problems in.
Grading these problems can be annoyingly fiddly. A MATH answer is LaTeX, where \frac{1}{2} and 0.5 are both correct, so it wants symbolic-equivalence checking rather than string comparison — NeMo-Skills, for example, ships both the datasets and that grader.
These data sets are widely regarded as being mined out. The stringent moving benchmark in MathArena (Dekoninck et al. 2026) mitigates this problem by managing data contamination somehow?
Those are all answer benchmarks — they grade a boxed number, which is the solver’s and the generalist’s game. The provers have their own, and these grade a proof, which the compiler either accepts or does not. MiniF2F (Zheng, Han, and Polu 2022) is the old smoke test — a few hundred Olympiad statements in Lean — and the strong provers have all but saturated it. The live target is PutnamBench (Tsoukalas et al. 2024), 600-odd Putnam problems formalised in Lean, where the leaders now solve most of the set and the very top is approaching all of it; FormalMATH (Z. Yu et al. 2025) is the bigger, broader follow-up. A machine-checked leaderboard handles contamination from the other end than MathArena does: it does not much matter if the statement leaked, because a self-consistent wrong proof still fails to compile, so the score measures proving rather than recall. The same machinery is starting to grade code — VERINA (Z. Ye et al. 2026) and VeriSoftBench (Xin et al. 2026) ask a model to emit a program, a Lean specification, and a machine-checked proof that the two agree, which is the through-line from proving theorems to provably-correct software.
7 Tiny recursive reasoners
A separate small-model thread keeps surfacing: HRM (G. Wang et al. 2025) and its stripped-down successor TRM (Jolicoeur-Martineau 2025) crack ARC-AGI, Sudoku and mazes with 7–27M-parameter nets that recurse on a fixed-size latent instead of emitting a long chain of thought. HRM’s headline was a brain-inspired two-timescale hierarchy, but ARC Prize’s ablation pinned the gains on the outer refine-and-retry loop plus heavy per-task augmentation rather than the hierarchy — and TRM then beat HRM with a plain two-layer net, which rather makes the case. This makes them siblings to the ARC-AGI Without Pretraining paper (Liao and Gu 2025).
I file these as a cousin of test-time scaling, not part of the maths story: again, more compute per problem beats a bigger single shot, but these are transductive puzzle-solvers trained per task, not language models that do calculus or emit Lean. I’d watch whether recurse-on-a-latent crosses over to language; it is not yet something I would use in a maths workflow.
8 But is it OK to automate mathematics?
I have complicated feelings about this. On one hand mathematics is upstream from much human progress, so I guess we should at least consider it. On the other hand, training in the punishing and mind-bending discipline of mathematics is IMO one of the highest achievements of the human race, not to mention one of the best ways we know of infusing our squishy meat brains, so losing the incentive to do it will do bad things to us.
I don’t know what to do about that.
9 Incoming
simplescaling/s1: s1: Simple test-time scaling (Muennighoff et al. 2025)
Test-time scaling is a promising new approach to language modelling that uses extra test-time compute to improve performance. Recently, OpenAI’s o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at this https URL
FranxYao/chain-of-thought-hub: Benchmarking large language models’ complex reasoning ability with chain-of-thought prompting/ Towards Complex Reasoning: the Polaris of Large Language Models (Fu et al. 2023)
sanderwood/bgpt: Beyond Language Models: Byte Models are Digital World Simulators (Wu et al. 2024)
New methods boost reasoning in small and large language models - Microsoft Research
