Neural nets that do symbolic mathematics, logic and other reasoning tasks

2019-12-08 — 2026-06-27

Wherein the Taxonomy of Mathematical AI Systems Is Surveyed, With Special Attention to the Distinction Between Solvers That Return Boxed Answers and Provers Whose Outputs Are Checked by a Lean Compiler.

compsci

language

machine learning

meta learning

networks

neural nets

NLP

stringology

Much of this is slop rn, expanded from my dot points and research notes publishing because it is useful, but expect a lot of edits as I clean it up and add the applied notes.

Somewhere between computational symbolic mathematics, automated proof assistants and large language models may be found the models that solve mathematical problems. General-purpose LLMs do OK at this; we have spent the last few years working, with variable success, on specialisations that do better.

This is the theory page. There is an applied companion.

1 Why are reasoning models even possible?

Why a next-token predictor should be able to reason at all — and how much of the apparent reasoning is search, memorization, or “true” generalization — deserves a proper treatment I have not written yet.

CoT, computational complexity, program synthesis etc. 🚧TODO🚧

In the absence of a proper write-up, I am curious about indications that reasoning may be separable from knowledge, because we kind of gave up on that for a while there. VibeThinker-3B (Xu et al. 2026) (weights) reports frontier-level verifiable maths and code from a 3B base (tiny, runs-on-a-phone), and its authors claim this as a compression-coverage split: verifiable reasoning compresses into a small core, while open-domain knowledge and general competence are what actually consume all the parameters. I’m treating this as evidence that a maths specialist might actually be attainable and usable.

Further reading on how and why a sequence model reasons, not yet folded into the above:

2 Design variables

Mathematical AI systems vary along several axes.

The Target is what counts as an answer: a simple expression/equation, a whole natural-language proof, a formal Lean proof, or even a program bundled with a specification and a machine-checked proof that the two agree. Different target types entail different supporting architecture.

Granularity is the grain at which we generate and check. A whole answer in one shot; a vote over whole answers; confidence attached to each claim in a trajectory; a proof grown one tactic at a time; or a theorem split into lemmas that are proved separately and reassembled. Does the system decompose a big question into smaller ones? Iteratively?

Verification is what tells us an answer is right: it could be “looking credible” to a mathsy model, or a more sophisticated check: a majority vote from many mathsy model, a separately-trained reward model that scores candidates, or an external oracle like the Lean compiler.

The model can be specialised on maths to a greater or lesser extent: a plain generalist that happens to be good at maths, a solver fine-tuned for natural-language maths, an autoformalizer tuned to turn English into Lean, or a prover tuned to emit Lean proofs, and various hybrids of these.

The harness can wrap the model with varying degrees of sophistication and support various types of granularity: a bare chat completion, a native tool-call, a tool-integrated sandbox that runs the model’s code, a Lean REPL that compiles its proofs, or a full orchestrator that decomposes problems and dispatches sub-goals to other models.

Not all combinations make sense; the Lean REPL harness only makes sense with a Lean-proving model, for example.

3 Generalists, solvers and provers

Most frontier language models are generalists that happen to be decent at maths. They do alright on AIME-style problems as a side effect of being good at language and following instructions. We can also use generalist harness with such a model; i.e. use Claude in Claude Code or whatever.

There is also a narrower class: models hard-specialised on mathematical reasoning, which trade away chat fluency and/or agentic competence to get it. The specialist family has two branches, differing in verification, and concomittantly on target.

Firstly, there are Solvers. These are more recognisable to the layperson. They are models that answer mathematical problems in natural language. Some have additionally been tweaked to run symbolic maths tools to check their working, the so-called TIR sub-family. Qwen2.5-Math (A. Yang et al. 2024), OpenMath-Nemotron (Moshkov et al. 2025), Skywork-OR1, AceMath (Liu et al. 2025) are all examples of these. If we’re used to classic AI chat, we’re used to these, since they are just classic chat programs but with a fixation on calculus. Their output is a chain of reasoning ending in a boxed answer, like a proof in a textbook.

Secondly, there are Provers. The prover model emits primarily Lean 4 (Lean 4 being the ascendant proof assistant language). Examples include Goedel-Prover (Lin et al. 2025), OProver (Ma et al. 2026), DeepSeek-Prover (Ren et al. 2025). Their output is a candidate machine-digestible formal proof; the Lean compiler checks it for correctness. These systems have custom harnesses designed to work with the weirdness of formal proofs.

generalist and solver answers have no magical special verifier (although they can use tricks like voting and self-critique to improve their reliability).
a prover’s outputs are checked against the Lean compiler, a reliable indicator of formal proof validity.

For each, we might need specialty reasoning harnesses to benefit from the full powers of the algorithm.¹.

4 Reasoning modes in solvers — CoT and TIR

The solver models break down into two classes — the ones that augment their reasoning with tools, and the ones that just think reeeeealllly hard.

We refer to the latter as the CoT (chain-of-thought) models. In practice, everything uses a chain-of-thought somewhere. But “CoT solvers” do nothing else; they are bareback, and as such, just totally normal LLMs but in nerdy application.

The former we call TIR (tool-integrated reasoning). The model writes mathematical code (e.g. in Python), runs it, reads the result back, and carries on, self-correcting any wordcel errors it made with the help of its shape-rotating software stack. TIR has a coupled harness specialization required also: it needs a special tool execution contract— the model described calculations bytorchestrator runs the code in a sandbox and splices the results back, and the model by convention uses a special syntax — generic harnesses do not work.

5 Test-time scaling

Solvers may spend extra inference compute to do better than a single shot and they may do so at various granularities. For simplicity here we cover the coarsest version, where we do a “whole proof at once”.

Notoriously, getting models to self-critique their answers is surprisingly effective (Shao et al. 2025). This is AFAIK still the major trick driving inference-time scaling. Essentially, we can spend more inference compute per question — sampling many candidate solutions and combining them — and beat a much larger single-shot model.²

5.1 Self-consistency — `maj@k`

The simplest test-time-scaling trick (X. Wang et al. 2023): draw \(k\) independent samples at non-zero temperature³, return the modal value. This is called maj@k.

5.2 Verifier search and best-of-n

When a model ships a reward model alongside the policy (AceMath does (Liu et al. 2025); some specialists do not), we can do better than plain voting: sample \(n\) candidates, score each with the reward model, return the highest-scoring. Best-of-\(n\) against a calibrated verifier beats maj@\(n\) for the same compute budget — the verifier carries information about which sample is right, not just which answer is most common.

Easy to say, but I have no idea how we would get such a reward model in general, nor why I should trust it.

The most famous instance of this trick at scale is OpenAI’s o1 and successors, which reputedly use a learned verifier (Lightman et al. 2023) to do beam-search-like reasoning trace selection inside the model’s chain-of-thought; see s1 (Muennighoff et al. 2025) for an open replication, and the probabilistic inference scaling work (Puri et al. 2025) for the formal framing.

5.3 The model as its own verifier

Where do we get that reward model? One answer: the policy grades itself via self-critique. DeepSeekMath-V2 (Shao et al. 2025) is the large-model version that does so; VibeThinker-3B’s Claim-Level Reliability Assessment (Xu et al. 2026) is the small-model one, which shows the move working at 3 billion parameters. Regardless of model size, the process is similar: the model draws \(k\) trajectories, splits each into a few decision-relevant claims, has the model verify each claim, and weights a trajectory’s vote by how many of its claims hold — so an answer reached through a broken step counts for less than the same answer reached cleanly. That lifts AIME26 from 94.3 to 97.1, which doesn’t sound like a big improvement but corresponds to a nominally much larger model’s performance.

The granularity of these systems is interesting: confidence attaches to each claim, not just the final boxed answer, so the loop can say which steps to trust rather than only which answer won the vote.

The published scores on AIME / GPQA / MATH are usually made from some kind of diversified voting scheme, e.g.

maj@\(k\) with \(k \geq 8\),
temperature ≠ 0,
often with the vendor’s own reward model in the loop (NeMo-Skills’ GenSelect (Moshkov et al. 2025), AceMath’s reward),

A single sample at \(T = 0\) (i.e. classic chat style) will do much worse.

6 Proving theorems with a compiler in the loop

Now, back to the other family, the formal provers, where the verification axis reaches its hard end and the target is a Lean proof rather than a number. Each iteration a model emits a candidate Lean 4 proof, the Lean compiler either accepts it or returns an error trace, and we either keep the proof or feed the error back to the model and retry.

This system has a lot of unusual properties:

We have a trustworthy verifier even on problems outside the training data. Unlike a maj@k vote, the compiler can’t be fooled by a self-consistent wrong answer. Either the proof typechecks (we’re done) or it doesn’t (the candidate is wrong). In practice this means we might sample many more candidate options, because even unlikely ones are easy to check. This is called Pass@j where \(j>k\) typically.
The self-correction loop is part of the model. The prover doesn’t one-shot a proof. Rather, the prover loop samples many candidates.
We are very sensitive to the “software” stack. Lean incorporates its mathematical knowledge in Mathlib; that comprises the bank of mathematics the LLM can call on. Mathlib version-pinning is the canonical cause of mysterious failed proofs.

Harnesses accordingly incorporate lots of Lean-specialist plumbing — Mathlib pinning and a Lean REPL server (e.g. LeanDojo (K. Yang et al. 2023)) — with the tool choices laid out in practice.

Now, most mathematical knowledge is not in fact written up as Lean programs, so these systems tend to need additional support. Specifically, we depend upon Autoformalization — turning a natural-language statement into Lean so a prover can attack it. Specialist autoformalization models exist (e.g. Herald (Gao et al. 2025)) but are not well commoditised. Herald ships safetensors only and won’t magically load up in a local stack. Without good autoformalization, the Lean-prover pipeline starts from a hand-written statement and only the proof is automated — though the newest decomposition systems pull autoformalization inside the loop.

7 Prover architectures

The compiler-in-the-loop story above is a story about one model and one retry loop, and that was a fair picture into early 2025. It is not the picture now. This is the fine end of the granularity axis: the proof gets built a tactic or a lemma at a time, not voted on whole. The single fine-tuned model that emits a whole proof — DeepSeek-Prover, Goedel-Prover — has slid down the PutnamBench leaderboard (Tsoukalas et al. 2024) until it is the baseline everyone else beats. What tops the scores is now an agent.

Axiomatic AI’s “minimal agent” paper (Requena et al. 2026) is the cleanest map I have found, so I will borrow its taxonomy. Provers split first into tree-search methods, which grow a proof tactic-by-tactic and consult Lean after each step (AlphaProof was the flagship), and whole-proof methods, which emit a candidate proof and compile it in one go. Whole-proof has been winning, and it has its own internal ladder:

One-shot, sampled wide. Generate \(k\) independent proofs, accept any that compiles. This is the bare Pass@j idea, and on its own it never cracks the hard problems.
Iterative refinement. Generate, compile, feed the errors back, retry — the loop from the section above. The minimal-agent ablations rank this as the single largest driver of success, ahead of model choice, tools, or memory. Goedel-Prover-V2’s two-round self-correction is exactly this.
Recursive decomposition. Break the theorem into sub-lemmas, prove each, reassemble — where the frontier sits now.

The decomposition systems stopped being a single model some time ago. Apple’s Hilbert (Varambally et al. 2026) runs four parts — a reasoner (a general LLM that does the informal maths and writes a proof sketch full of sorry holes), a prover (a cheap Lean specialist that closes the atomic goals), a retriever (semantic search over Mathlib for relevant lemmas), and the compiler as verifier — and it recurses on any sub-goal it cannot close directly. The choice of reasoner turns out to matter more than the choice of prover. The maths specialist is the commodity part; the generalist doing the decomposition is where the wins are.

ByteDance’s Seed-Prover (L. Chen et al. 2025; J. Chen et al. 2025) and Harmonic’s Aristotle (Achim et al. 2025) are the same shape with more machinery bolted on — Seed-Prover adds a natural-language prover that drafts the argument in prose, a sketch model that turns the prose into a Lean skeleton, and a dedicated geometry engine; Aristotle interleaves a Monte Carlo proof search with an informal-reasoning pipeline and retrains itself on its own search traces mid-run. Pull one detail back to the solver/prover taxonomy: autoformalization has moved inside the loop. Above I treated turning English into Lean as a separate, flakier front-end. In these systems the natural-language proof is the plan: the informal argument is drafted first and formalised lemma-by-lemma, with formal failures feeding back to revise the informal sketch. Draft-sketch-prove is the spine now, not an add-on.

Which raises the obvious question — if refinement and decomposition are doing the work, how specialist does the model underneath really need to be? I take that up next.

8 The frontier is shifting

The action seems to be moving away from specialist models and toward a generalist model checked finely inside a heavy harness.

On the open maths leaderboards the generalists increasingly out-score the narrow solvers. Requena et al. (2026) (the minimal agent paper) makes the point that an off-the-shelf frontier model — they use Claude Opus — wrapped in a feedback-plus-memory-plus-search loop, with no fine-tuning at all, lands competitively against the specialist provers. Also, it gets cheaper and better for free every time the base model improves. “Prover” is turning into a style of harness that encourages the use of formal verification as part of a loop.

Answer checking seems to be tending to finer granularity and more formal verification. A model can now bring a verifier with it, or even be that verifier — DeepSeekMath-V2 (Shao et al. 2025) writes natural-language proofs and grades the individual claims therein itself. On the formal side, autoformalization has moved inside the loop, so the brittle English-to-Lean front-end is no longer a separate stage but a step the system revises as it goes.

The weirdest systems close the loop entirely and train mid-run: Harmonic’s Aristotle (Achim et al. 2025) retrains itself on its own search traces while it works.

All of which means the specifics here have a short half-life. Nonetheless I attempted to sketch out an escalation path that seems feasible to implement at home.

9 Benchmarks and data

What is the MNIST/ImageNet of mathematical reasoning? These!

GSM8K (grade-school word problems with integer answers) is totally MNIST, which is to say, a smoke test.
MATH — people usually use the MATH-500 subset
AIME (Mathematics Olympiad recruitment problems with integer answers 0–999) is more like ImageNet, i.e. it has some actually hard problems in.

Grading these problems can be annoyingly fiddly. A MATH answer is LaTeX, where \frac{1}{2} and 0.5 are both correct, so it wants symbolic-equivalence checking rather than string comparison — NeMo-Skills, for example, ships both the datasets and that grader.

These data sets are widely regarded as being mined out. The stringent moving benchmark in MathArena (Dekoninck et al. 2026) mitigates this problem by managing data contamination somehow?

The benchmarks split along the target axis, the same way the systems do. The ones above are all answer benchmarks — they grade a boxed number, which is the solver’s and the generalist’s game. The provers have their own, and these grade a proof, which the compiler either accepts or does not. MiniF2F (Zheng, Han, and Polu 2022) is the old smoke test — a few hundred Olympiad statements in Lean — and the strong provers have all but saturated it. The live target is PutnamBench (Tsoukalas et al. 2024), 600-odd Putnam problems formalised in Lean, where the leaders now solve most of the set and the very top is approaching all of it; FormalMATH (Z. Yu et al. 2025) is the bigger, broader follow-up. A machine-checked leaderboard handles contamination from the opposite end to MathArena: it does not much matter if the statement leaked, because a self-consistent wrong proof still fails to compile, so the score measures proving rather than recall. The same machinery is starting to grade code — VERINA (Z. Ye et al. 2026) and VeriSoftBench (Xin et al. 2026) ask a model to emit a program, a Lean specification, and a machine-checked proof that the two agree, carrying the compiler-as-verifier idea on from theorems to provably correct software.

10 But is it OK to automate mathematics?

I have complicated feelings about this. On one hand mathematics is upstream from much human progress, so I guess we should at least consider it. On the other hand, training in the punishing and mind-bending discipline of mathematics is IMO one of the highest achievements of the human race, not to mention one of the best ways we know of infusing our squishy meat brains, so losing the incentive to do it will do bad things to us.

I don’t know what to do about that.

11 Incoming

12 References

Achim, Best, Bietti, et al. 2025. “Aristotle: IMO-Level Automated Theorem Proving.”

Akyürek, Damani, Qiu, et al. 2024. “The Surprising Effectiveness of Test-Time Training for Abstract Reasoning.”

An, Cai, Cao, et al. 2025. “AMO-Bench: Large Language Models Still Struggle in High School Math Competitions.”

Bansal, Hosseini, Agarwal, et al. 2024. “Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling.”

Biddle. 2023. “Views of Australians Towards Science and AI.”

Bubeck, Chandrasekaran, Eldan, et al. 2023. “Sparks of Artificial General Intelligence: Early Experiments with GPT-4.”

Chen, Jiangjie, Chen, Du, et al. 2025. “Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience.”

Chen, Luoxin, Gu, Huang, et al. 2025. “Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving.”

Clark, Tafjord, and Richardson. 2020. “Transformers as Soft Reasoners over Language.” In IJCAI 2020.

Dehghani, Gouws, Vinyals, et al. 2019. “Universal Transformers.” In.

Dekoninck, Jovanović, Gehrunger, et al. 2026. “Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs.”

Dziri, Lu, Sclar, et al. 2023. “Faith and Fate: Limits of Transformers on Compositionality.”

Fu, Ou, Chen, et al. 2023. “Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models’ Reasoning Performance.”

Gao, Wang, Jiang, et al. 2025. “Herald: A Natural Language Annotated Lean 4 Dataset.”

Garcez, and Lamb. 2020. “Neurosymbolic AI: The 3rd Wave.”

Hao, Sukhbaatar, Su, et al. 2024. “Training Large Language Models to Reason in a Continuous Latent Space.”

Kiciman, Ness, Sharma, et al. 2024. “Causal Reasoning and Large Language Models: Opening a New Frontier for Causality.” Transactions on Machine Learning Research.

Kwa, West, Becker, et al. 2025. “Measuring AI Ability to Complete Long Tasks.”

Lample, and Charton. 2019. “Deep Learning for Symbolic Mathematics.” arXiv:1912.01412 [Cs].

Lightman, Kosaraju, Burda, et al. 2023. “Let’s Verify Step by Step.”

Lin, Tang, Lyu, et al. 2025. “Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction.”

Liu, Chen, Shoeybi, et al. 2025. “AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling.”

Lombrozo. 2024. “Learning by Thinking in Natural and Artificial Minds.” Trends in Cognitive Sciences.

Mahowald, Ivanova, Blank, et al. 2024. “Dissociating Language and Thought in Large Language Models.” Trends in Cognitive Sciences.

Ma, Ma, Guo, et al. 2026. “OProver: A Unified Framework for Agentic Formal Theorem Proving.”

Mirzadeh, Alizadeh, Shahrokhi, et al. 2024. “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.”

Moshkov, Hanley, Sorokin, et al. 2025. “AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning Dataset.”

Muennighoff, Rush, Barak, et al. 2023. “Scaling Data-Constrained Language Models.” Advances in Neural Information Processing Systems.

Muennighoff, Yang, Shi, et al. 2025. “S1: Simple Test-Time Scaling.”

Puri, Sudalairaj, Xu, et al. 2025. “Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs Using Particle-Based Monte Carlo Methods.”

Radford, Wu, Child, et al. 2019. “Language Models Are Unsupervised Multitask Learners.”

Raschka. 2026. Build a Reasoning Model (from Scratch).

Ren, Shao, Song, et al. 2025. “DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition.”

Requena, Letson, Nowakowski, et al. 2026. “A Minimal Agent for Automated Theorem Proving.”

Sassaman, Patterson, Bratus, et al. 2013. “Security Applications of Formal Language Theory.” IEEE Systems Journal.

Schuurmans, Dai, and Zanini. 2024. “Autoregressive Large Language Models Are Computationally Universal.”

Shao, Luo, Lu, et al. 2025. “DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning.”

Shao, Wang, Zhu, et al. 2024. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.”

Shojaee, Mirzadeh, Alizadeh, et al. 2025. “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.”

Tsoukalas, Lee, Jennings, et al. 2024. “PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition.”

Varambally, Voice, Sun, et al. 2026. “Hilbert: Recursively Building Formal Proofs with Informal Reasoning.”

Venhoff, Arcuschin, Torr, et al. 2025. “Base Models Know How to Reason, Thinking Models Learn When.”

Wang, Haiming, Unsal, Lin, et al. 2025. “Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning.”

Wang, Xuezhi, Wei, Schuurmans, et al. 2023. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.”

Wei, Wang, Schuurmans, et al. 2023. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.”

Wu, Tan, Wang, et al. 2024. “Beyond Language Models: Byte Models Are Digital World Simulators.”

Xin, Chen, Durrett, et al. 2026. “VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean.”

Xu, Liu, Wang, et al. 2026. “VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models.”

Yang, Kaiyu, Swope, Gu, et al. 2023. “LeanDojo: Theorem Proving with Retrieval-Augmented Language Models.”

Yang, An, Zhang, Hui, et al. 2024. “Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement.”

Ye, Jiacheng, Gong, Chen, et al. 2024. “Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models.” In.

Ye, Zhe, Yan, He, et al. 2026. “VERINA: Benchmarking Verifiable Code Generation.”

Yue, Chen, Lu, et al. 2025. “Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?”

Yu, Zhouliang, Peng, Ding, et al. 2025. “FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models.”

Yu, Ping, Xu, Weston, et al. 2024. “Distilling System 2 into System 1.”

Zhang, Backurs, Bubeck, et al. 2022. “Unveiling Transformers with LEGO: A Synthetic Reasoning Task.”

Zheng, Han, and Polu. 2022. “MiniF2F: A Cross-System Benchmark for Formal Olympiad-Level Mathematics.”

Footnotes

Mac picks here ↩︎
Practicalities explored in the applied notes.↩︎
e.g. set \(T \approx 0.6\), \(k = 8\) to \(16\) and reseed↩︎

1 Why are reasoning models even possible?

2 Design variables

3 Generalists, solvers and provers

4 Reasoning modes in solvers — CoT and TIR

5 Test-time scaling

5.1 Self-consistency — maj@k

5.2 Verifier search and best-of-n

5.3 The model as its own verifier

6 Proving theorems with a compiler in the loop

7 Prover architectures

8 The frontier is shifting

9 Benchmarks and data

10 But is it OK to automate mathematics?

11 Incoming

12 References

Footnotes

5.1 Self-consistency — `maj@k`