Neural nets that do symbolic mathematics, logic and other reasoning tasks

2019-12-08 — 2026-05-31

Wherein the Application of Language Models to Mathematical Reasoning Is Surveyed, With Test-Time Compute Scaling and Self-Critique Being Examined as Mechanisms by Which Small Models Are Made to Exceed Larger Baselines.

compsci
language
machine learning
meta learning
networks
neural nets
NLP
stringology

Somewhere between computational symbolic mathematics and automated proof assistants and the modern large language models are models that can solve mathematical problems more effectively than my feeble brain. General purpose LLMs can do OK at this, but it turns out that we can do much better if we specialise.

cf differentiable automata learning.

Watch this space.

Figure 1

1 Solvers and provers

Most frontier language models are generalists that happen to be decent at maths. They do alright on AIME-style problems as a side effect of being good at language and following instructions. There is a narrower class: models hard-specialized on mathematical reasoning, which trade away chat fluency and/or agentic competence to get it. I am interested in this family and the associated techniques.

There are two branches in the family:

Firstly, there are Solvers. These are more recognisable. They are models that answer mathematical problems in natural language. Some of them have additionally been tweaked to run symbolic maths tools to check their working, the so-called TIR sub-family. Qwen2.5-Math (A. Yang et al. 2024), OpenMath-Nemotron (Moshkov et al. 2025), Skywork-OR1, AceMath (Liu et al. 2025) are all examples of these. If you are used to classic AI chat you are used to these, since they are just classic chat programs but with a fixation on calculus. Their output is a chain of reasoning ending in a boxed answer, like a proof in a textbook.

Secondly, there are Provers. These models emit Lean 4 and very little else useful (Lean 4 being the ascendant proof assistant language). Examples include Goedel-Prover (Lin et al. 2025), OProver (Ma et al. 2026), DeepSeek-Prover (Ren et al. 2025). They output is a candidate formal proof; the compiler decides whether it is one.

Each family represents a different approach to verification:

For each, we might need specialty reasoning harnesses to benefit from the full powers of the algorithm.1. See Maths and proof models, applied for specifics and pragmatics.

2 Reasoning modes in Solvers — CoT and TIR

The Solver models break down into two classes— the ones that augment their reasoning with tools, and the ones that just think reeeeealllly hard.

We refer to the latter as the CoT (chain-of-thought) models. I think in practice, everything uses CoT. But CoT solvers do nothing else; they are bareback, and as such, just totally normal LLMs but nerdier.

The former we call TIR (tool-integrated reasoning). The model writes mathematical code (e.g. in Python), runs it, reads the result back, and carries on, self correcting any wordcel errors it made with the help of its shape-rotating software stack.

TIR requires a special harness layer: the model server just emits tokens, while an orchestrator drives a sandbox that runs the code and splices the results back. The model has to be trained to emit code in a specific template the harness recognizes. Generic harnesses do not work.

3 Test time scaling

Notoriously, getting models to self-critique is unreasonably effective (Shao et al. 2025). This is AFAIK still the major trick driving inference-time scaling. Essentially, we can spend more inference compute per question — sampling many candidate solutions and combining them — and beat a much larger single-shot model.

3.1 Self-consistency — maj@k

The simplest test-time-scaling trick (X. Wang et al. 2023):1 Draw \(k\) independent samples at non-zero temperature2, return the modal value. This is called maj\@k.

3.2 Verifier search and best-of-n

When a model ships a reward model alongside the policy (AceMath does (Liu et al. 2025); some specialists do not), we can do better than plain voting: sample \(n\) candidates, score each with the reward model, return the highest-scoring. Best-of-\(n\) against a calibrated verifier beats maj@\(n\) for the same compute budget — the verifier carries information about which sample is right, not just which answer is most common.

Easy to say, but I have no idea how we would get such a reward model in general.

The most famous instance of this trick at scale is OpenAI’s o1 and successors, which use a learned verifier (Lightman et al. 2023) to do beam-search-like reasoning trace selection inside the model’s chain-of-thought; see s1 (Muennighoff et al. 2025) for an open replication, and the probabilistic inference scaling work (Puri et al. 2025) for the formal framing.

The published score on AIME / GPQA / MATH figures are usually:

  • maj@\(k\) with \(k \geq 8\),
  • temperature ≠ 0,
  • often with the vendor’s own reward model in the loop (NeMo-Skills’ GenSelect (Moshkov et al. 2025), AceMath’s reward),
  • and reported on a fresh contamination-checked subset, not the raw benchmark.

A single sample at \(T = 0\) will do much worse (i.e. classic chat style) will do much worse.

4 Proving theorems with a compiler in the loop

Now, back to the other family, the formal provers. Each iteration a model emits a candidate Lean 4 proof, the Lean compiler either accepts it or returns an error trace, and we either keep the proof or feed the error back to the model and retry.

This system has a lot of unusual properties:

  • We have a trustworthy verifier even on problems outside the training data. Unlike a maj@k vote, the compiler can’t be fooled by a self-consistent wrong answer. Either the proof typechecks (we’re done) or it doesn’t (the candidate wrong). In practice this means we might sample many more candidate options, because even unlikely ones are easy to check. This is called Pass@j where \(j>k\) typically.
  • The self-correction loop is part of the model. The prover doesn’t one-shot a proof. Rather, the prover loop samples many candidates.
  • We are very sensitive to the “software” stack. Lean incorporates its mathematical knowledge in Mathlib and this comprises the bank of human knowledge the LLM can call on. Better keep it tidy.

Harnesses accordingly incorporate lots of Lean-specialist stuff (lean-repl, Pantograph, LeanDojo (K. Yang et al. 2023), Mathlib pinning) in practice.

Now, most mathematical knowlesdge is not in fact written up as lean programs, so these systems tend to need additional support. Specifically, we depend upon Autoformalization — turning a natural-language statement into Lean so a prover can attack it. Specialist autoformalization models exist (e.g. Herald (Gao et al. 2025)) but are not well commoditized. Herald ships safetensors only and won’t magically load up in a local stack. Without good autoformalization, the Lean-prover pipeline starts from a hand-written statement and only the proof is automated.

5 But is it OK to automate mathematics?

I have complicated feelings about this. On one hand mathematics is upstream from much human progress, so I guess we should at least consider it. On the other hand, training in the punishing and mind-bending discipline of mathematics is IMO one of the highest achievements of the human race, no to mention one of the best ways we know of infusing our squishy meat brains, so losing the incentive to do it will do bad things to us.

I don’t know what to do about that.

6 Using tiny models

7 Incoming

Akyürek, Damani, Qiu, et al. 2024. The Surprising Effectiveness of Test-Time Training for Abstract Reasoning.”
Bansal, Hosseini, Agarwal, et al. 2024. Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling.”
Biddle. 2023. Views of Australians Towards Science and AI.”
Bubeck, Chandrasekaran, Eldan, et al. 2023. Sparks of Artificial General Intelligence: Early Experiments with GPT-4.”
Clark, Tafjord, and Richardson. 2020. Transformers as Soft Reasoners over Language.” In IJCAI 2020.
Dehghani, Gouws, Vinyals, et al. 2019. Universal Transformers.” In.
Dziri, Lu, Sclar, et al. 2023. Faith and Fate: Limits of Transformers on Compositionality.”
Fu, Ou, Chen, et al. 2023. Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models’ Reasoning Performance.”
Gao, Wang, Jiang, et al. 2025. Herald: A Natural Language Annotated Lean 4 Dataset.”
Garcez, and Lamb. 2020. Neurosymbolic AI: The 3rd Wave.”
Hao, Sukhbaatar, Su, et al. 2024. Training Large Language Models to Reason in a Continuous Latent Space.”
Jolicoeur-Martineau. 2025. Less Is More: Recursive Reasoning with Tiny Networks.”
Kiciman, Ness, Sharma, et al. 2024. Causal Reasoning and Large Language Models: Opening a New Frontier for Causality.” Transactions on Machine Learning Research.
Kwa, West, Becker, et al. 2025. Measuring AI Ability to Complete Long Tasks.”
Lample, and Charton. 2019. Deep Learning for Symbolic Mathematics.” arXiv:1912.01412 [Cs].
Lightman, Kosaraju, Burda, et al. 2023. Let’s Verify Step by Step.”
Lin, Tang, Lyu, et al. 2025. Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction.”
Liu, Chen, Shoeybi, et al. 2025. AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling.”
Mahowald, Ivanova, Blank, et al. 2024. Dissociating Language and Thought in Large Language Models.” Trends in Cognitive Sciences.
Ma, Ma, Guo, et al. 2026. OProver: A Unified Framework for Agentic Formal Theorem Proving.”
Mirzadeh, Alizadeh, Shahrokhi, et al. 2024. GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.”
Moshkov, Hanley, Sorokin, et al. 2025. AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning Dataset.”
Muennighoff, Rush, Barak, et al. 2023. Scaling Data-Constrained Language Models.” Advances in Neural Information Processing Systems.
Muennighoff, Yang, Shi, et al. 2025. S1: Simple Test-Time Scaling.”
Puri, Sudalairaj, Xu, et al. 2025. Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs Using Particle-Based Monte Carlo Methods.”
Radford, Wu, Child, et al. 2019. “Language Models Are Unsupervised Multitask Learners.”
Ren, Shao, Song, et al. 2025. DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition.”
Schuurmans, Dai, and Zanini. 2024. Autoregressive Large Language Models Are Computationally Universal.”
Shao, Luo, Lu, et al. 2025. DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning.”
Shao, Wang, Zhu, et al. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.”
Shojaee, Mirzadeh, Alizadeh, et al. 2025. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.”
Venhoff, Arcuschin, Torr, et al. 2025. Base Models Know How to Reason, Thinking Models Learn When.”
Wang, Guan, Li, Sun, et al. 2025. Hierarchical Reasoning Model.”
Wang, Haiming, Unsal, Lin, et al. 2025. Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning.”
Wang, Xuezhi, Wei, Schuurmans, et al. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models.”
Wei, Wang, Schuurmans, et al. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.”
Wu, Tan, Wang, et al. 2024. Beyond Language Models: Byte Models Are Digital World Simulators.”
Yang, Kaiyu, Swope, Gu, et al. 2023. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models.”
Yang, An, Zhang, Hui, et al. 2024. Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement.”
Ye, Gong, Chen, et al. 2024. Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models.” In.
Yue, Chen, Lu, et al. 2025. Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yu, Xu, Weston, et al. 2024. Distilling System 2 into System 1.”
Zhang, Backurs, Bubeck, et al. 2022. Unveiling Transformers with LEGO: A Synthetic Reasoning Task.”

Footnotes

  1. Mac picks here↩︎

  2. e.g. set \(T \approx 0.6\), \(k = 8\) to \(16\) and reseed↩︎