Maths and proof models, applied

Delegating the queen of the sciences to the machines

2026-05-31 — 2026-07-02

Wherein a Personal Mathematical Reasoning Workflow Is Assembled, Proceeding From Generalist Agents and Oracle Tools Through Fan-Out Sampling, Answer-Equivalence Pitfalls, and Lean Proof Assistants.

compsci

diy

economics

machine learning

neural nets

NLP

premature optimization

slop

Mathematical reasoning LLMs are a thing, and as a working mathematician I want them in a workflow that lets me reason faster over new domains, check my work, and augment my feeble meat brain.

The marketplace is thin, as we have come to expect for mathematical tools. It does not exactly heave with polished, friendly off-the-shelf competitors, as it does for munging office memos, or for agentic coding. So — we build our own, I guess? Or rather, I have been begrudgingly building my own. This post is distilled from my attempt to speed-run the last two years of mathematical reasoning LLMs for my own benefit.

I already run consumer-grade agent harnesses for code — a model and a sandbox and a loop that splices the two. What do I need to add to get sweet theorems?

It depends what I want of course; SOTA systems have a relatively sophisticated execution loop. We get a long way by simply adding equation support into a vanilla agent harness front-end. So we’ll speed-run the bits that matter, hot-swapping parts as we go, roughly tracking the field’s own progress.

This is mostly slop right now but publishing because it is useful.

1 The path through

Start from the coding harness I already run I can add various types of sophistication.

Easy mode — use an agent with a generalist backend but be smart about how I use it (basic agentic)
Lend it an oracle — keep that same agent, but give it a maths model behind a solve() tool for the sub-problems it should not attempt. (agentic + oracle)
Custom loop — explicitly make the loop into a generate-run-verify system, using potentially specialist models throughout; the bare model under it is a plain solver or, for theorems, a plain prover. (our own self-correcting loop)
Memory — augment iterations with memory of what failed across prior attempts so the loop stops repeating itself.
Fan-out is a trick to run multiple agents to sample solutions and cross-check, trading compute for accuracy.
Lean formalises the problem into machine-checkable form.
The frontier — all of the above. A self-grading, decomposing, re-formalizing orchestrator. (recursive decomposition)

That sequence is also roughly the field’s own trajectory.

2 Easy mode

An ordinary agent harness I already use for code, pointed at a good generalist reasoning model, gets me a long way toward powerful mathematics. Claude, the latest DeepSeek, or a big Qwen3-Thinking are all respectable mathematicians; tell them to “do the maths”, then “check the working with Python/SymPy”.

This works pretty great with e.g. Claude Desktop, Claude Code etc. However local clients are less capable at rendering equations, so we may want to build Open WebUI to make equations easier on human eyes.

3 Lend an agent an oracle

Now we can use our existing agent as an orchestrator, but let it delegate mathematics (or the checking of mathematics) to a specialist maths model. Practically, this means we give it a solve() oracle tool to solve knotty maths problems, splicing the result back into its own reasoning.

Python harnesses like Smolagents or Qwen-Agent can simply call a Python function. For other agents (Claude-like, Goose, Codex, Gemini CLI, pi, Hermes) we expose the solve() as a CLI tool. Both of these can be documented by an agentskills.io SKILL.md in skill-supporting agents. Skill-blind agents can use an MCP server.

I whipped up mathx as an example of doing all three, in a relatively swappable manner.

We might wish to fan-out in such an oracle. Done synchronously, this can be slow and hit time limits on tool use, or just get boring. Asynchrony and parallelism depend on the orchestrating agent. When we own the loop a fan-out is just asyncio.gather, and there is no timeout that we do not ourselves set. For tool-use agents, the async-handle pattern is probably what we want — we set up a handle-and-poll pair of ordinary tools: submit_solve() returns a job id at once and check_solve(id) returns status-or-result at once, so each call is fast. Some harnesses also dispatch background sub-agents of their own and MCP supports Tasks at the protocol level; that might work too.

4 Build the loop

If easy mode plus an oracle aren’t enough, we can write a custom agentic loop to specialize the generic agent loop for mathematical needs.

Recall that agentic harnesses have three parts:

a model server that emits tokens,
an executor/verifier that acts on the model’s output, and
an orchestrator that only ever calls run(input) -> result, splices the result back, and repeats.

That looks eerily similar to the coding harness, albeit a little narrower. A coding executor “does work”: it runs some program/edits some file/prints something. Results of previous actions are used by the model to pick its next move, and they are supposed to produce some artefaact (some piece of software) as a side effect. A maths executor per default only ever checks a calculation. The output of this check is the goal, rather than the side-effects.

Verification loops have some specific nice properties; for example, as side-effect-free functions, they let us spend compute to buy correctness by making multiple attempts. Generate a hundred candidate answers, pick the most-likely-correct ones and we have fan-out. If we accumulate a record of prior mistakes and dead-ends, we have memory.

In a classic generalist agent, the tools invoked could be just about anything. By convention, solver only executes Python code in a sandbox (so-called tool-integrated reasoning) — to assist the model in arithmetic it cannot “do in its head”. A prover agent only executes the Lean compiler which rigorously checks candidate proofs.

Two maths-harness gotchas:

(some) Specialist models speak their own tool-call dialect, specifically the TIR type. These produce fenced code templates, not the native tools/tool_calls protocol — so the orchestrator needs matching parsing.
The model writes the code we run unobserved, sandboxing is wise. The only maths-specific detail is that the specialists assume a full scientific Python (SymPy, NumPy), so build accordingly.

5 Memory

A loop that forgets repeats itself. The orchestrator splices each compiler result back within a run, but the next independent attempt starts cold — re-importing the same non-existent lemma, mis-naming the same tactic the last twenty attempts did, spending the iteration budget on mistakes already made. We can carry a record of what failed into the next attempt.

Requena et al. (2026) rank this the second-largest lever in the whole architecture — behind iterative refinement, ahead of the search tools — and it is cheap to add. The crude version pastes the whole transcript of past attempts into the prompt, but that bloats the context, makes every call slower and dearer, and eventually overflows the window. The version that works is a short self-managed note: after each attempt the proposer rewrites a running scratchpad of what it learned — which lemmas don’t exist, which tactics misfired — pruned so it never grows without bound. On their numbers that buys around 7% more theorems at ~20% lower cost, with half the run-to-run variance.

With memory in place — and the library-search tool alongside it — the prover stops re-making the same name and syntax errors, freeing budget for proof search rather than for rediscovering that Nat.foo was renamed three Mathlib versions ago.

6 Fan-out

Fan-out improves mathematical accuracy by evaluating many candidate solutions. For solvers that means maj@k, i.e. majority voting. For provers it’s Pass@k — keep any candidate that compiles. Either way we can run a lot of these fuckers at once, since the tokens stream over the network in parallel. Which sample to trust is its own question — a plain majority vote, or richer schemes that rank candidates with a verifier or let the model grade its own — but the lever is the same.

The cheapest thing to fan out is a model that runs no code at all: a pure chain-of-thought reasoner emits no Python and no Lean, so its executor is a no-op and maj@k collapses to plain sample-and-vote, no sandbox per chain.

If I don’t want to own even this much of the loop, OpenRouter’s Fusion Router is a pre-rolled, hosted version of fan-out-and-judge: point a request at openrouter/fusion, a panel of models answers in parallel, and a judge model compares (rather than merges) their answers into structured consensus/contradiction/blind-spot analysis that the outer model writes its final answer from. It runs about 4–5× the cost of a single completion for the default three-model panel. This is the soft end of everything above — the “let the model grade its own” richer scheme, not the majority vote — and the judge is a generalist reading prose, not a CAS or a Lean compiler, so for maths it inherits all the answer-equivalence trouble below rather than escaping it. It is useful where the verdict was always going to be soft, like the pure chain-of-thought end; no substitute for the compiler when I want a verdict that can’t be talked round.

6.1 Where the compute comes from

Build and debug the whole loop on a laptop first — a local IPython kernel for the solver, lean-repl on disk for the prover — and never pay for a GPU until you fan out.¹ Modal is a tidy backend to fan out to.

Tokens come from a model behind a scale-to-zero, OpenAI-compatible vllm serve endpoint — a few dozen lines mount a Volume so the weights download once and the GPU drops to zero between requests, so we pay only for the seconds a request actually runs.

Fan-out is then just a .map — one chain per call, each with its own sandbox, all sharing the one endpoint:

# maj@k as a fan-out
answers = solve_one.map([problem] * 32, range(32))   # 32 chains, 32 sandboxes, one endpoint
winner  = Counter(a for a in answers if a).most_common(1)[0][0]

The same shape sweeps a whole problem set, or runs a prover’s Pass@k with the Lean verifier as the executor. A heavy tool — a Gröbner basis, a big symbolic integral, a Sage or PARI-GP call — goes on its own cheap CPU modal.Sandbox next to the GPU endpoint, so we are not paying for an idle H100 to run SymPy.

What can we rent instead of self-hosting? Fewer options than I’d like. Few specialist maths models are hosted per token, and the one serverless home for the narrow solvers (OpenMath-Nemotron, AceMath²) is Featherless, which caps concurrency per plan — exactly the wrong constraint for fan-out. So sampling these wide usually means self-hosting, on RunPod (from ~$0.27/hr) or Modal (scale-to-zero, which wins on bursty use).³ The exception is at the top of the range: DeepSeek-Prover-V2-671B would need an eight-GPU node to self-host, but Novita meters it cheaply (~$0.70/$2.50 per 1M tokens), so pairing it with a local lean-repl loop is a cheap way into industrial proving.

6.2 On a laptop

When the model runs no code, the whole loop collapses onto a single Mac: maj@k is just $k$ chat calls and a Counter — no Modal, no .map, no per-chain sandbox. Point it at a local endpoint that can batch the samples and keep the model resident; oMLX is the convenient one, for reasons covered in the Mac notes. A small-but-strong reasoner shines here — many copies for the price of one specialist — and VibeThinker-3B runs even at laptop scale, frontier-level verifiable maths at 3B if the numbers hold. We wrap it in a plain solve() and call it straight from the loop, not as an MCP tool — which is only needed when the agent is doing the orchestration.`

# maj@k against a local oMLX endpoint — the reasoner emits no code, so no sandbox
from openai import AsyncOpenAI
import asyncio, re
from collections import Counter

client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="x")
BOXED  = re.compile(r"\\boxed\{([^}]*)\}")

async def sample(problem):
    r = await client.chat.completions.create(
        model="vibethinker:solve",          # temp 1.0 / top-p 0.95 set on the endpoint (greedy = nothing to vote on)
        messages=[{"role": "user", "content": problem}], max_tokens=32000)
    hits = BOXED.findall(r.choices[0].message.content or "")
    return hits[-1].strip() if hits else None   # last boxed = the final answer

async def solve(problem, k=16):
    answers = await asyncio.gather(*(sample(problem) for _ in range(k)))
    votes   = Counter(a for a in answers if a)
    return (votes.most_common(1)[0][0] if votes else None), votes

Two details. Set the sampling parameters: greedy decoding (temperature 0.0) returns the same answer every time, so there is nothing to vote over. And extraction is easier than it looks — a reasoning model buries its answer under a long <think> trace, but most OpenAI-compatible servers expose the post-think conclusion separately (oMLX puts it in reasoning_content), so the content we regex is the short conclusion and the last \boxed{} is reliably the answer the model is committing to. Return the whole votes tally, not just the winner: the margin is the loop’s confidence signal — 14/16 is a different thing to act on than a 6/5/5 split, where the driver can escalate $k$ or surface the disagreement instead of committing.

6.3 When are two answers the same?

From the description above, this all sounded relatively straightforward. Did you see what was going to go wrong, reader? It turns out, maj@k is not at all trivial to implement because written mathematics is not formal, which means it is not at all clear when two expressions are equivalent (as I learned implementing mathx). Is 0.5 \log \cos(ax-b) the same as \frac12 \ln \cos(b-ax)? The standard fix (e.g. in math-verify) is to parse both answers into a Computer Algebra System (CAS) and test equality there. Which is better than nothing but not complete, because…

Parsing. LaTeX is a typesetting language, not a syntax. The inconsistencies are each reactively fixable, one patch at a time, but the supply of such inconsistencies is inexhaustible.
Denotation. One sample writes $\sigma_1$, another writes $s_1$, for the same quantity. The claim “these denote the same thing” cannot be settled on its merit. (math-verify’s strict=False tries positional variable-matching; but this did not help and additionally clustered things it should not, like $\mu_1$ with $\mu_2$.)
Undecidability. Zero-equivalence for expressions built from $\exp$, $\log$, $\pi$ and friends is undecidable — Richardson’s theorem (Richardson 1968) — so simplify can only ever be heuristic and there will be exceptions.

In the latter two classes, people try to use LLM judges (another LLM) to guess whether things are the “same”. OpenAI’s simple-evals scores MATH with a model-based equality-checker prompt rather than rules; NVIDIA’s NeMo-Skills ships an LLM judge beside its sympy checks. The trained verifier models are less useful than they sound: they grade candidate-against-ground-truth, not candidate-against-candidate, which is what clustering needs. xVerify has bad licensing; and the Omni-MATH-2 audit (Ballon et al. 2026) found Omni-Judge wrong in 96% of its disagreements. CompassVerifier (Apache-2.0) is the credible trained option, and its human-labelled VerifierBench is a free test set for whatever judge you do run.

In the little toy mathx project, I went for a tiered, audited approach. Exact match wins, then CAS equality (both directions — verify() is asymmetric), then an opt-in LLM judge for the pairs the CAS refuses, with order-bias hygiene (both presentation orders must independently say equivalent) and conservatively reject anything that does not parse. The margin display attributes the merging to the actual method used (like: 11/12 (3 judge merges)). On one of my test problems (maj@16), this got us from 4/12 naïve matches, to 8/12 after the parsing fixes, up to 11/12 with the judge.

7 Lean etc

Provers have their own requirements — the Lean-emitting family. Lean is the proof-assistant language they target, and the compiler-in-the-loop is what makes a prover a prover: everything above works on a boxed answer, while a prover produces a machine-checked proof. Everything so far is the coding harness with a maths model dropped in; Lean breaks that pattern in two places.

The first is the executor. It stops being a Python sandbox whose output we splice back, and becomes an exact verifier: the Lean compiler accepts or rejects a proof, and there’s no arguing with it. That’s mostly good news. The compile-and-retry loop feeds compiler errors back where a tool result used to go, and stops when the compiler accepts the proof. The orchestrator is the solver’s loop with a different halting rule, and the sandbox is still the same run(code) -> result box — now with a Lean server where the IPython kernel sat:

# verifier sandbox — same interface as the IPython one, Lean underneath
@app.cls(image=lean_image, cpu=8.0, memory=32768, scaledown_window=120)
class LeanVerifier:
    @modal.method()
    def run(self, proof: str, timeout=300):
        ...   # POST to the in-image kimina-lean-server; return {"ok": bool, "errors": [...]}

A Lean toolchain plus a built Mathlib is gigabytes and hours to compile, and the prebuilt cache that skips the rebuild only matches an exactly pinned toolchain — so pin Mathlib to the model’s commit, or proofs fail for reasons that have nothing to do with the maths. Underneath this runs atop one of lean-repl (the lightest), Pantograph, or LeanDojo, usually behind the Kimina Lean Server so Mathlib isn’t re-imported for every proof. This lean_image business is fiddly, version-sensitive plumbing. Modal’s own theorem-proving case study shows how it is done.

The second break is at the front of the pipeline, and it has no clean fix. A prover, unlike a solver, takes a formal Lean statement, so a “prove this English sentence” pipeline first needs autoformalization — a model to turn the English into a Lean theorem … := by sorry. That step is a separate, much less reliable model (Kimina-Autoformalizer, Goedel-Formalizer-V2): the formalizers produce a faithful statement only 55–75% of the time, while the verifier downstream is exact. So in a fully automated prover the weak link is no longer the proof but the statement. Start from a formal statement, as the benchmarks do, and the prover mirrors the solver almost line for line; start from English, and autoformalization is the least trustworthy link in the chain.

8 The frontier

The top of the leaderboard isn’t a bigger prover we can rent by the token. The current leaders — Aleph from Logical Intelligence, Seed-Prover 1.5, Aristotle — are API-or-pilot-only, priced in dollars per problem rather than a flat per-token rate — Aleph’s runs cost an estimated $23–$68 each. The strongest of these have cleared gold-medal-equivalent on IMO 2025 in machine-checked Lean.⁴ The open-weight specialist provers we can download — DeepSeek-Prover, Goedel-Prover — are near the bottom of that same leaderboard.

It turns out much of that performance gap, can be made up by using available models cleverly. Against the leaders much of what we lack turns out to be the right loop, not the model — we can rent a generalist nearly as good as their specialists if we use it right. Requena et al. (2026) rank iterative refinement first and memory second as features of such a loop. So the leaders are closed as products; but we might imagine that we can recover a good approximation of them from open recipes.

AxProverBase is such a harness, in the lean4 prove style. It has no recursive decomposition, and was tested with an off-the-rack Claude Opus 4.5 (no fine-tuning) in an iterative compile-and-retry loop with memory and Mathlib search. It reaches 54.7% on PutnamBench — near-parity with the open specialist Hilbert while spending about 100× fewer tokens, at roughly $12.6 a problem against Aleph’s $23–68.

The top rung is (probably) the same loop with heavier engineering, every crude part swapped for a subtler one: a self-grading model where the majority vote used to be, an interleaved formalize-and-critique step where the bare compiler used to be, and autoformalization pulled inside the loop lemma by lemma to soften the brittleness of the Lean interface.

9 Frontends

The frontend is the loosest-coupled part of the stack — the maths model neither knows nor cares what renders its tokens — so picking one is cheap and reversible. Four questions decide it:

Does it render maths?
Will it take a PDF?
Can it call the oracle?
If we are proving: can we inspect the Lean proof state?

When a hosted general agent is the orchestrator there is not much to decide: we read the output in whatever client we already drive, and the oracle hands back a short answer and a vote tally rather than a reasoning trace, so there is little to render anyway. The questions bite when we move onto open models, or build our own loop.

	Maths rendering	PDFs into context	Oracle calls	Lean state
Anthropic bundle (Desktop, Code, Cowork)	out of the box	native multimodal, drag-and-drop	MCP + skills	no
Open WebUI	out of the box, every delimiter	built-in RAG	MCP, Tools, Python interpreter	no
Qwen-Agent `WebUI`	delimiters registrable	built-in RAG (`files=[…]`)	a plain Python function	no
Goose / OpenCode / Jan	needs a system-prompt line	BYO via MCP	MCP; skills in Goose	no
our own script	ours to render	ours to parse	a plain function call	the compiler’s complaints
Lean 4 web editor	n/a	n/a	n/a	yes — the goal panel

The Anthropic bundle takes the first two questions by brute convenience; the standard caveats about proprietary vendors apply. Among the self-hosted options, Open WebUI and Qwen-Agent are the two that answer yes to the first three.

9.1 Maths rendering

Hosted clients render LaTeX without our help, so this question only bites self-hosted frontends, and there it bites in two places: delimiters and streaming. Delimiters first. Models like to emit $…$ and \[…\], while most chat UIs only wire $…$ and $$…$$ into their renderer, so the equation arrives as raw backslashes. The fix is a system-prompt line telling the model to emit dollar delimiters; Goose, OpenCode, and Jan all need it. Open WebUI does not: its KaTeX pass registers every delimiter style — $…$ , $$…$$, $…$, \[…\], even \begin{equation} — behind guards that keep prose dollar signs out of maths mode.⁵ Qwen-Agent attacks it from the other end: its Gradio UI lets us register the delimiters at the renderer (gr.Chatbot(latex_delimiters=…)) instead of nagging the model. Streaming is the second fumble.⁶

9.2 PDFs into context

The papers we reason about are PDFs, and how one gets into context is an architecture question — the general survey is in the agents notebook. The Anthropic bundle reads the binary natively: vision models, drag-and-drop, charts and layout included. Open WebUI and Qwen-Agent run their own extraction-plus-RAG pipelines, which handle prose papers fine but tend to mangle exactly the parts a mathematician wants — display equations, dense notation, scanned pages. For those, convert deliberately first and feed the markdown to whichever frontend; extraction quality is a property of the converter, not of the chat window. Goose, OpenCode, and Jan bring no pipeline of their own, so they are on that deliberate path regardless.

9.3 Oracle calls

The oracle mechanics are above; the frontend question is only which plumbing it accepts. MCP is the lowest common denominator — Open WebUI, Goose, Jan, and the Anthropic bundle all speak it, so one mathx MCP server covers the lot. Skill-speaking agents (Claude, Goose) can take the same oracle as a CLI tool documented by a SKILL.md, which is cheaper on context. Qwen-Agent skips protocol entirely: the oracle is a Python function we register. The frontend-level trap is the timeout — a synchronous solve() that fans out for three minutes will exceed some clients’ tool-use time limit, which is what the async handle pattern is for.

9.4 Lean state

No chat frontend renders Lean proof state; the fourth question has one good answer and it is not a chat UI. The Lean 4 web editor runs a Lean server with a live goal panel, so we keep it open beside the chat rather than build our own. When we own the loop we at least see what the loop sees: the compiler’s complaints are the loop’s food, and printing them is free. A custom React app with an embedded editor buys full control for a day or two of plumbing; I’d live in Open WebUI first and defer that until the loops earn it.

9.5 Our loop behind Open WebUI

Owning the loop need not mean owning the frontend too. The table above splits “Open WebUI” from “our own script” as if they were rival choices, but Open WebUI is built to keep its UI while we swap the brain underneath — so we can run our own orchestrator and still inherit the KaTeX rendering, the RAG, and the chat history without writing any of it. No fork required; there are three seams for this, in rising order of ceremony:⁷

A Pipe Function is the low-friction seam: a Python function that registers as a custom “model” in the chat dropdown, intercepts the prompt, and returns whatever our loop produces. Enough for a single custom agent.
A Pipeline is the same idea one size up — a standalone Python service that Open WebUI talks to as if it were a model, so the generate-run-verify loop (routing, retries, fan-out, the oracle call) lives in its own process. Upload a .py file, or point it at a raw GitHub URL, and it shows up as a model in the UI. This is the one I’d reach for first for the maths loops here.
MCP (native since v0.6.31) is for the tools the loop reaches for rather than the loop itself — the mathx oracle, a Lean server — and composes with either of the above.

So the sequence is: keep the stock UI, put the loop in a Pipeline, hang tools off MCP, and defer the custom React app until proof-state rendering — or something else the chat window can’t give us — earns it.

10 Incoming

The Journal of Artificial Mathematics
AxProverBase — the reference implementation of Requena et al. (2026): the same compile-and-retry loop, plus a memory module, Mathlib search, and a two-stage review — deterministic checks for sorry/admit, then an LLM reviewer that confirms the statement wasn’t weakened and flags metaprogramming dodges that compile without proving the theorem.
Bolzano (source) — a colleague’s “automatic researcher” aimed at open problems

11 References

Achim, Best, Bietti, et al. 2025. “Aristotle: IMO-Level Automated Theorem Proving.”

Akyürek, Damani, Qiu, et al. 2024. “The Surprising Effectiveness of Test-Time Training for Abstract Reasoning.”

Ballon, Algaba, Verbeken, et al. 2026. “Benchmarks Saturate When The Model Gets Smarter Than The Judge.”

Chen, Jiangjie, Chen, Du, et al. 2025. “Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience.”

Chen, Luoxin, Gu, Huang, et al. 2025. “Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving.”

Dekoninck, Jovanović, Gehrunger, et al. 2026. “Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs.”

Goemans, Altman, Dreksler, et al. 2026. “Comprehensive AI Governance Requires Addressing Non-Model Gains.”

Lin, Tang, Lyu, et al. 2025. “Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction.”

Liu, Chen, Shoeybi, et al. 2025. “AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling.”

Ma, Ma, Guo, et al. 2026. “OProver: A Unified Framework for Agentic Formal Theorem Proving.”

Moshkov, Hanley, Sorokin, et al. 2025. “AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning Dataset.”

Ren, Shao, Song, et al. 2025. “DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition.”

Requena, Letson, Nowakowski, et al. 2026. “A Minimal Agent for Automated Theorem Proving.”

Richardson. 1968. “Some Undecidable Problems Involving Elementary Functions of a Real Variable.” The Journal of Symbolic Logic.

Shao, Luo, Lu, et al. 2025. “DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning.”

Shao, Wang, Zhu, et al. 2024. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.”

Su, Diao, Lu, et al. 2025. “ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration.”

Varambally, Voice, Sun, et al. 2026. “Hilbert: Recursively Building Formal Proofs with Informal Reasoning.”

Venhoff, Arcuschin, Torr, et al. 2025. “Base Models Know How to Reason, Thinking Models Learn When.”

Wang, Unsal, Lin, et al. 2025. “Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning.”

Yang, Kaiyu, Swope, Gu, et al. 2023. “LeanDojo: Theorem Proving with Retrieval-Augmented Language Models.”

Yang, An, Zhang, Hui, et al. 2024. “Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement.”

Ye, Yan, He, et al. 2026. “VERINA: Benchmarking Verifiable Code Generation.”

Yu, Peng, Ding, et al. 2025. “FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models.”

Footnotes

Laptop-scale model picks (OpenMath-Nemotron, Qwen2.5-Math, Goedel-Prover, all via MLX) are in the Mac notes.↩︎
AceMath is CC-BY-NC: non-commercial only, wherever it runs.↩︎
Privacy ranking, briefly: self-hosting on Modal is strongest; the metered shortlist (Novita, Featherless) was pre-filtered to no-train endpoints; keep unpublished proofs off may-train first-party APIs and be thoughtful about OpenRouter, which delegates retention upstream.↩︎
Mind the marketing on Aleph, though. Logical Intelligence frames it as a step beyond LLMs toward an energy-based reasoner that “does not think in words”, but by their own telling the benchmarked Aleph is built on an LLM, and the language-free model is a separate, unreleased, unbenchmarked thing.↩︎
Recent, though: inline $…$ was deliberately excluded into early 2026, so an older install still wants the system-prompt line.↩︎
A $$ arrives before its closing pair, so the equation shows as raw text until the delimiter lands — the maths case of the flash-of-incomplete-markdown problem, fixed by buffering until the delimiters balance. Rendering a document rather than a chat stream sidesteps it entirely.↩︎
Effort scales the way we’d expect: a bare custom loop is a weekend, a solid personal setup with model routing and retries a few evenings, a polished thing with auth and observability longer — all tractable precisely because Open WebUI keeps display and orchestration separate.↩︎