Neural nets that do symbolic mathematics, logic and other reasoning tasks
2019-12-08 — 2026-05-31
Wherein the Application of Language Models to Mathematical Reasoning Is Surveyed, With Test-Time Compute Scaling and Self-Critique Being Examined as Mechanisms by Which Small Models Are Made to Exceed Larger Baselines.
Somewhere between computational symbolic mathematics and automated proof assistants and the modern large language models are models that can solve mathematical problems more effectively than my feeble brain. General purpose LLMs can do OK at this, but it turns out that we can do much better if we specialise.
cf differentiable automata learning.
Watch this space.
1 Solvers and provers
Most frontier language models are generalists that happen to be decent at maths. They do alright on AIME-style problems as a side effect of being good at language and following instructions. There is a narrower class: models hard-specialized on mathematical reasoning, which trade away chat fluency and/or agentic competence to get it. I am interested in this family and the associated techniques.
There are two branches in the family:
Firstly, there are Solvers. These are more recognisable. They are models that answer mathematical problems in natural language. Some of them have additionally been tweaked to run symbolic maths tools to check their working, the so-called TIR sub-family. Qwen2.5-Math (A. Yang et al. 2024), OpenMath-Nemotron (Moshkov et al. 2025), Skywork-OR1, AceMath (Liu et al. 2025) are all examples of these. If you are used to classic AI chat you are used to these, since they are just classic chat programs but with a fixation on calculus. Their output is a chain of reasoning ending in a boxed answer, like a proof in a textbook.
Secondly, there are Provers. These models emit Lean 4 and very little else useful (Lean 4 being the ascendant proof assistant language). Examples include Goedel-Prover (Lin et al. 2025), OProver (Ma et al. 2026), DeepSeek-Prover (Ren et al. 2025). They output is a candidate formal proof; the compiler decides whether it is one.
Each family represents a different approach to verification:
- solver answers are AFAICT checked against a known correct answer (so we need either to know or guess one),while
- prover outputs are checked against the Lean compiler (which is a reliable indicator of formal proof validity).
For each, we might need specialty reasoning harnesses to benefit from the full powers of the algorithm.1. See Maths and proof models, applied for specifics and pragmatics.
2 Reasoning modes in Solvers — CoT and TIR
The Solver models break down into two classes— the ones that augment their reasoning with tools, and the ones that just think reeeeealllly hard.
We refer to the latter as the CoT (chain-of-thought) models. I think in practice, everything uses CoT. But CoT solvers do nothing else; they are bareback, and as such, just totally normal LLMs but nerdier.
The former we call TIR (tool-integrated reasoning). The model writes mathematical code (e.g. in Python), runs it, reads the result back, and carries on, self correcting any wordcel errors it made with the help of its shape-rotating software stack.
TIR requires a special harness layer: the model server just emits tokens, while an orchestrator drives a sandbox that runs the code and splices the results back. The model has to be trained to emit code in a specific template the harness recognizes. Generic harnesses do not work.
3 Test time scaling
Notoriously, getting models to self-critique is unreasonably effective (Shao et al. 2025). This is AFAIK still the major trick driving inference-time scaling. Essentially, we can spend more inference compute per question — sampling many candidate solutions and combining them — and beat a much larger single-shot model.
3.1 Self-consistency — maj@k
The simplest test-time-scaling trick (X. Wang et al. 2023):1 Draw \(k\) independent samples at non-zero temperature2, return the modal value. This is called maj\@k.
3.2 Verifier search and best-of-n
When a model ships a reward model alongside the policy (AceMath does (Liu et al. 2025); some specialists do not), we can do better than plain voting: sample \(n\) candidates, score each with the reward model, return the highest-scoring. Best-of-\(n\) against a calibrated verifier beats maj@\(n\) for the same compute budget — the verifier carries information about which sample is right, not just which answer is most common.
Easy to say, but I have no idea how we would get such a reward model in general.
The most famous instance of this trick at scale is OpenAI’s o1 and successors, which use a learned verifier (Lightman et al. 2023) to do beam-search-like reasoning trace selection inside the model’s chain-of-thought; see s1 (Muennighoff et al. 2025) for an open replication, and the probabilistic inference scaling work (Puri et al. 2025) for the formal framing.
The published score on AIME / GPQA / MATH figures are usually:
- maj@\(k\) with \(k \geq 8\),
- temperature ≠ 0,
- often with the vendor’s own reward model in the loop (NeMo-Skills’ GenSelect (Moshkov et al. 2025), AceMath’s reward),
- and reported on a fresh contamination-checked subset, not the raw benchmark.
A single sample at \(T = 0\) will do much worse (i.e. classic chat style) will do much worse.
4 Proving theorems with a compiler in the loop
Now, back to the other family, the formal provers. Each iteration a model emits a candidate Lean 4 proof, the Lean compiler either accepts it or returns an error trace, and we either keep the proof or feed the error back to the model and retry.
This system has a lot of unusual properties:
- We have a trustworthy verifier even on problems outside the training data. Unlike a maj@k vote, the compiler can’t be fooled by a self-consistent wrong answer. Either the proof typechecks (we’re done) or it doesn’t (the candidate wrong). In practice this means we might sample many more candidate options, because even unlikely ones are easy to check. This is called Pass@j where \(j>k\) typically.
- The self-correction loop is part of the model. The prover doesn’t one-shot a proof. Rather, the prover loop samples many candidates.
- We are very sensitive to the “software” stack. Lean incorporates its mathematical knowledge in Mathlib and this comprises the bank of human knowledge the LLM can call on. Better keep it tidy.
Harnesses accordingly incorporate lots of Lean-specialist stuff (lean-repl, Pantograph, LeanDojo (K. Yang et al. 2023), Mathlib pinning) in practice.
Now, most mathematical knowlesdge is not in fact written up as lean programs, so these systems tend to need additional support. Specifically, we depend upon Autoformalization — turning a natural-language statement into Lean so a prover can attack it. Specialist autoformalization models exist (e.g. Herald (Gao et al. 2025)) but are not well commoditized. Herald ships safetensors only and won’t magically load up in a local stack. Without good autoformalization, the Lean-prover pipeline starts from a hand-written statement and only the proof is automated.
5 But is it OK to automate mathematics?
I have complicated feelings about this. On one hand mathematics is upstream from much human progress, so I guess we should at least consider it. On the other hand, training in the punishing and mind-bending discipline of mathematics is IMO one of the highest achievements of the human race, no to mention one of the best ways we know of infusing our squishy meat brains, so losing the incentive to do it will do bad things to us.
I don’t know what to do about that.
6 Using tiny models
- Jolicoeur-Martineau (2025)
- G. Wang et al. (2025)
- Probabilistic Inference Scaling / Red-Hat-AI-Innovation-Team/its_hub (Puri et al. 2025) – “Can we use classical probabilistic inference methods to scale small LMs to o1 level?”
7 Incoming
simplescaling/s1: s1: Simple test-time scaling (Muennighoff et al. 2025)
Test-time scaling is a promising new approach to language modelling that uses extra test-time compute to improve performance. Recently, OpenAI’s o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at this https URL
FranxYao/chain-of-thought-hub: Benchmarking large language models’ complex reasoning ability with chain-of-thought prompting/ Towards Complex Reasoning: the Polaris of Large Language Models (Fu et al. 2023)
sanderwood/bgpt: Beyond Language Models: Byte Models are Digital World Simulators (Wu et al. 2024)
New methods boost reasoning in small and large language models - Microsoft Research ## References
Footnotes
e.g. set \(T \approx 0.6\), \(k = 8\) to \(16\) and reseed↩︎
