Maths and proof models, applied
Delegating the queen of the sciences to the machines
2026-05-31 — 2026-06-08
Wherein the Same Three-Role Loop—Model Server, Code Sandbox, Orchestrator—Is Applied to Both Tool-Integrated Solvers and Lean 4 Provers, With Deployment Paths From Laptop to Scale-to-Zero Cloud Surveyed
Mathematical reasoning LLMs are a thing. How should I use them in practice? As a working mathematician I want a fluid workflow that lets me reason faster over new domains, check my work, and augment my feeble meat brain. As we should expect for mathematical tools, the ecosystem lacks depth. There just isn’t the competitive marketplace of well-considered tools that we would see for e.g. munging office memos or legal advice or some other such normie workflow. Two different jobs hide inside that wish, and they want different tooling, so I will split them up front. The first is an everyday maths copilot — fast, low-ceremony, good enough to check an integral or scout an unfamiliar domain mid-paper. The second is industrial solving and proving — self-hosted specialist models, Lean pipelines, cloud fan-out over many samples. The everyday job has a boring answer, which I give in easy mode just below; most of this post is the second job, the deep end, for when that answer runs out.
This is mostly slop right now but publishing because it is useful.
1 Easy mode
I can get a long way with the ordinary agent harness I already use for code, pointed at a rented generalist. A frontier model I can rent — Claude, the latest DeepSeek, a big Qwen3-Thinking — is, dropped into that harness, also a fine mathematician. The loop is the generic one: native tool-calls into a Python/SymPy sandbox, the model deciding when to compute, the sandbox running it, the result coming back — no maths-specialist weights, no TIR template, no GPU. For everyday work — checking an integral, scouting an unfamiliar domain mid-paper — a single careful pass is plenty. To actually read the output — rendered LaTeX, the executed code, its results — Open WebUI handles it out of the box with a built-in Python interpreter, and I say more about frontends below. The whole privacy cost is that I am renting, so I point the harness at a no-train endpoint and move on.
This is most of the value, and it is the softening SOTA cashed out: the rentable generalists now top the open maths leaderboards, so the low-effort setup is close to the best one anyway. The one lever past a single pass — sampling a question many times and majority-voting the answer (maj@k) — is a script around the endpoint, not anything the chat window does for me, so I count it as the first step into the deep end rather than part of easy mode. Everything below is the deep end. I go beyond easy mode only when I need a model nobody rents, a formal Lean proof rather than a boxed number, or fan-out over thousands of samples that only pays off self-hosted.
2 Terminology
Recall that mathematical reasoners come in three flavours — a generalist baseline, and two specialists that split on how their output gets checked.
- Generalist
- a frontier LLM with nothing maths-specific in the harness — pure chain-of-thought, or native tool-calls. The flavour we can rent per-token, and increasingly the strongest on the open leaderboards.
- Solver
- answers a problem in natural language, optionally writing and running Python along the way — using a variant of the agentic harness, a tool-integrated-reasoning (TIR) system. Examples: Qwen2.5-Math, OpenMath-Nemotron, Skywork-OR1. Needs a specialized code-execution loop.
- Prover
- emits Lean 4 and little else, and a compiler decides whether the output is a proof. Examples: Goedel-Prover, OProver, DeepSeek-Prover. Needs a compile-and-retry loop.
Both flavours of loop are provided by special reasoning harnesses, described next.
3 Reasoning harnesses
Both specialist flavours — the TIR-tuned solvers and the Lean provers — need harnesses different from the generic agent libraries. They split on the loop each drives: code-execution for solvers, compile-and-retry for provers. Drop a maths specialist into smolagents or Claude Code and it will silently fail; the harness has to be matched to the model.
For the reasoning-mode mechanics (what CoT and TIR are, why provers don’t hallucinate Pass@n, how to read AIME numbers), see the maths sections of the reasoning notebook. For Mac-specific runner detail (which MLX vs GGUF builds exist, Docker-on-Mac caveats), see Maths specialists in the Mac notebook.
3.1 Why specialized reasoning models need specialized harnesses
The mismatch is at the wire-format layer. General-purpose agent libraries (smolagents, Claude Code, Cursor, any MCP client) drive code execution through OpenAI-style function-calling JSON — the model emits a structured tool_call object that the harness parses. Maths and prover specialists were never trained to emit that JSON. They emit Python (or Lean) inside their own trained template: a fenced ```python block with a matching ```output block spliced back, or a <tool_call>…</tool_call> tag pair, or a Lean tactic block.
The JSON-parsing harness never sees a tool_call, so the specialist’s code prints as text, the interpreter never fires, and the loop never closes. The access options for these models divide by whether the harness speaks the model’s native format.
3.2 Three roles, independently placed
Whatever the flavour, the harness factors into three roles, and the point of naming them is that each can run somewhere different.
- Model server
-
emits tokens — a local Osaurus or
mlx_lm.server, a rented GPU, or a hosted endpoint. - Executor / verifier
- the part that acts on the model’s output — a Python sandbox that runs a TIR code block, or a Lean compiler that accepts-or-rejects a proof. The two flavours differ only in what sits here.
- Orchestrator
- the loop that ties them together — generate, run-or-compile, splice the result back, repeat.
The orchestrator only needs an executor with a run(...) -> result interface, so whether that executor is a kernel on the laptop or a remote sandbox sized to a heavy computation is a deployment choice, not a change to the loop. The applied side of that choice — local kernel versus a remote Modal sandbox, and the cost, privacy, and batch consequences — is worked through in where the executor runs.
3.3 TIR — tool-integrated reasoning
The loop itself is simple: the model emits a block of Python, a sandbox runs it, the output is spliced back into the context, and generation continues until a boxed answer. In the stack vocabulary the loop lives in the harness layer — the model server just serves tokens; a client on top adds the execute-and-feed-back behaviour.
Where the trained-template detail matters: the delimiters differ by model and must be read off the model’s code-tags, not guessed from its name. Qwen2.5-Math fences code as a ```python block and reads the result back from a ```output block, stopping on that output fence and a --- divider. OpenMath-Nemotron instead wraps code in <tool_call>…</tool_call> tags. Either way, keep a single IPython kernel alive across steps so variables persist, and cap the fed-back output (~1000 characters) or a runaway print floods the context.
Four ways to wire TIR up, format-matched first:
- Roll our own (~100 lines). Endpoint-agnostic, points at any OpenAI-compatible local server, no dependency beyond a Python sandbox. The contract: generate with a stop on the code-end delimiter, extract the code, run it in a long-lived IPython kernel, splice the output back in the form the model expects, continue until the boxed answer.
- Qwen-Agent. The reference loop for Qwen2.5-Math — ships a
TIRMathAgentwith the correct prompt and code-tags. Point itsllmconfig at a local/v1endpoint; the shipped demo hardcodes Qwen’s DashScope cloud, so that one line needs swapping. - NeMo-Skills. NVIDIA’s reference harness for OpenMath-Nemotron, and the only thing that does GenSelect (its best-of-n selection stage). It doesn’t have to host the model — point its
nsCLI at a local endpoint and let it own the loop. Overkill unless we specifically want GenSelect. - Generic agents and GUIs. smolagents, Open Interpreter, Open WebUI’s code interpreter, MCP python-sandbox servers. They run locally and they execute code, but they impose their own code-calling convention, which fights the maths model’s trained format. For point-and-click code execution with a generalist model these are fine; for maths-TIR with a maths specialist, use one of the format-matched harnesses above.
3.4 Proof self-correction loops
A second specialist class — Lean provers like Goedel-Prover-V2, OProver, DeepSeek-Prover — emits Lean 4 and nothing else useful. Raw weights only emit Lean text, and on its own that’s worthless: Lean either accepts a proof or it doesn’t. Every headline miniF2F number assumes a compiler in the loop: generate, compile, feed the error back on failure, regenerate.
That self-correction loop is the difference between a one-shot guess and the benchmark figure, which makes the surrounding tooling part of the model rather than an optional extra:
- Lean 4 + Mathlib via
elan/lake— the checker that decides whether a proof is a proof. - A bridge from model output to that checker.
lean-replis the lightest, and enough for the self-correction loop the whole-proof models rely on. Pantograph is faster and built for proof search. LeanDojo is the heavy option for tactic-level tree search. - Pin Mathlib to the model’s commit. Provers are brittle to lemma renames across Mathlib versions, and a mismatched pin fails proofs for reasons that have nothing to do with the maths.
For why a compiler in the loop is a different kind of verifier than a maj@k vote — and why this lets us sample provers at Pass@32 without worrying about false positives — see Proving theorems with a compiler in the loop.
For sandboxing the model-written code these loops execute — what works, what does not, Docker --network none on a Mac — see Sandboxing model-generated code; the pattern is generic to every code-running agent.
4 Where to run it
I suspect that mathematics is naturally executed as a cloud process, because there are so many opportunities for parallelism. But just for fun, let’s see how far we can go with local stacks too, and compare.
| TIR — solvers | Lean 4 — provers | |
|---|---|---|
| Laptop (7–32B) | OpenMath-Nemotron-14–32B or Qwen2.5-Math via MLX in Osaurus; Qwen-Agent or a ~100-line loop; a local IPython kernel runs the code | DeepSeek-Prover-V2-7B → Goedel-Prover-V2-32B (MLX 8-bit); lean-repl self-correction; Lean 4 + Mathlib on disk |
| Cloud (32B–671B) | OpenMath-Nemotron-32B + GenSelect via NeMo-Skills, on vLLM; or pull tokens from a hosted endpoint and run the executor locally | OProver-32B’s agentic pipeline (vLLM + Kimina Lean Server + the OProofs corpus); or DeepSeek-Prover-V2-671B off the shelf (below) |
The matrix covers only the two specialist loops — a generalist is mostly an ordinary agent so there are no special endpoint concerns.
A laptop is a cheap place to build and experiment with the loops — both the TIR executor and the lean-repl plumbing could run locally or in the cloud, so there is no reason to pay for a GPU while debugging them. I collected some Mac-specific runner detail for those curious about running everything locally. The cloud is interesting for parallelism.
5 Off the shelf
Few open-source model cloud endpoints host serverless (i.e. pay-per-token) mathematical specialist models. The narrow maths and prover models are mostly download-and-self-host. There are some options, though:
| Model | Cost-per-token | Subscription | Otherwise |
|---|---|---|---|
| DeepSeek-Prover-V2-671B | Novita (~$0.70/$2.50 per 1M) | — | |
| Qwen2.5-Math-7B, OpenMath-Nemotron-7B/14B/32B, AceMath-72B | — | Featherless (~$10–100/mo flat) | else DIY |
| Goedel-Prover-V2-8B/32B, Kimina-Distill-8B | — | Featherless | else DIY |
| Qwen2.5-Math-72B, Skywork-OR1-Math-7B, Mathstral, DeepSeek-Prover-V2-7B, OProver-8B/32B, Kimina-Prover-72B | — | — | DIY only |
Surprisingly, NVIDIA’s build.nvidia.com does not host their famed OpenMath-Nemotron or AceMath.
No provider runs the TIR or Lean loop for us. The code execution is on us.
Featherless is seemingly the only serverless host for the specialist solvers — OpenMath-Nemotron and AceMath1 can be found in no per-token catalogue (DeepInfra, Novita, Together, Fireworks), only there. This is irritating for parallelism: Featherless caps concurrency per plan, so sampling these specialists wide means self-hosting.
The top row of that table is surprising: a frontier Lean prover — DeepSeek-Prover-V2-671B, which would otherwise need an eight-GPU node — is metered on Novita at a very low price. Paired with a local lean-repl loop, that is a seriously cheap entry point into industrial-grade automated proving.
6 DIY cloud
If a model is DIY-only (every prover except the 671B, and the bigger solvers), we can still build a token-serving box in the cloud and host. This is not so hard these days.
RunPod is one easier-than-AWS option: an SSH/Jupyter machine with a GPU and a big disk, billed per-minute and cheapish. I’ve used it before. Hefty tools like OProver’s complicated pipeline, with its multi-million-file corpus, could be stashed on RunPod. The cheap end runs ~$0.27/hr, plenty for a Lean server, which is likely CPU-bound; more for GPU obvs.
Modal is a semi-DIY serverless model (“rent functions, not a box”) with special magical Python support, and it wins on bursty use because it scales to zero between requests. For Python-wrapped jobs it often goes surprisingly well, by which I mean, is idiot-proof to set up. The TIR Python executor and the Lean verifier can likely both be wrapped for its modal.Sandbox. Persistent Volumes are cheap ($0.09/GiB-month, 1 TiB free), though packing a multi-million-file corpus or a Mathlib cache onto one has its own gotchas.
7 A worked example
Parallelism is the whole case for the cloud here, so it pays to be clear up front about what is actually parallel.
A single solve is irreducibly sequential: each code block depends on the result of the last, so the model, the executor, and the orchestrator take turns, and no amount of hardware shortens that chain. The parallelism is across samples and problems — the maj@k draws are independent, a problem set’s entries are independent, and a prover’s Pass@32 is thirty-two independent chains. That is the axis to fan out along, and the binding constraint there is almost never the GPU. The three roles parallelize unevenly: the model server already batches many chains against one card (vLLM, SGLang), the orchestrator is just \(k\) async loops, and the laggard is the executor — a hand-rolled loop holds one kernel, a bare lean-repl compiles one proof at a time. So the question a workflow answers is not “how big a GPU” but “how do we run many executors cheaply”.
Modal answers that tidily, and the rest of this section is the shape of one concrete instance — a heavy model on a rented GPU, the executor wherever we want it. The working code is in pudding; here I only want the skeleton.
Tokens come from a solver served on Modal as a scale-to-zero, OpenAI-compatible vllm serve endpoint: a few dozen lines that mount a Volume so the weights download once and drop the GPU to zero when no one is solving, so we pay per second only while a request is in flight.
The loop drives that endpoint, stops on the code-end fence, runs each emitted block, and splices the result back in the template the model expects — the TIR contract. It only ever calls run(code) -> str, and does not care where run happens; that one-method interface is the hinge the whole workflow turns on.
The executor is anything satisfying that interface. On the laptop it is a persistent IPython kernel — variables carry across blocks, which a stateful model like Qwen2.5-Math assumes — which keeps the model-written code and our problem data on a machine we control, the strong end of the privacy story. A wall-clock timeout is the pragmatic security floor; anything shared wants the kernel in Docker with --network none, per sandboxing model-generated code.
That single-problem, local-executor loop is enough to debug the whole contract — fence parsing, kernel state, stop tokens — and for interactive work I rarely need more. The cloud pays off at the next step.
7.1 Where the executor runs
Two things push the executor off the laptop, and Modal handles both the same way.
The first is a heavy tool. A Gröbner basis, a large symbolic integral, a Monte Carlo estimate, or a Sage/PARI-GP call can each take longer than the reasoning that asked for it, and they want cores and RAM, not an H100. Because the loop only needs run(code) -> str, we lift the executor onto a modal.Sandbox sized to the tool — a cheap CPU box beside the GPU endpoint — and the loop body never changes. Co-locating execution on the model box, as NeMo-Skills does by default, quietly pays for an idle H100 to run SymPy.
The second is fan-out, and it is what we rent the cloud for in the first place. Once the executor is a remote function, maj@k is a .map — one full chain per call, each with its own sandbox, all sharing the single vLLM endpoint:
Modal schedules the calls across as many containers as the concurrency cap allows, and scale-to-zero means the burst costs only the seconds it runs. The same shape sweeps a whole problem set — .map over problems instead of seeds — and parallelizes a prover’s Pass@k with the Lean verifier as the per-call executor. Most harnesses simply ignore this axis.
Resist the n>1 reflex — \(k\) completions from a single request look like a discount, but only the prompt prefill is shared, and in maths the prompt is short against a long reasoning trace, so decoding \(k\) traces stays the irreducible cost. maj@k buys accuracy, not a discount.2
8 The prover is the same shape
Everything above served a solver, but the matrix’s right-hand column — a Lean prover — reuses almost all of it. The three roles don’t change; only the sandbox does.
The model server is the solver’s vllm serve endpoint with the model string swapped — deepseek-ai/DeepSeek-Prover-V2-7B or Goedel-LM/Goedel-Prover-V2-32B — and the token budget raised, because a proof carries its chain-of-thought plan and runs long (8k tokens for the 7B, 32k for Goedel-32B, ~40k once self-correction is on).3 vLLM serves them OpenAI-style like any other model.
The orchestrator is the same loop with a different halting rule. The solver stops when the model emits no fresh code; the prover stops when the compiler accepts the proof, and on failure feeds the compiler errors back in place of a tool result. Goedel-Prover-V2’s built-in self-correction is exactly this loop, two rounds deep.
The sandbox is the only part that changes, and it is still a run(code) -> result box — it just runs Lean instead of Python. The canonical checker is leanprover-community/repl, which speaks JSON over stdin: send {“cmd”: “<lean>”}, get back a sorries list and a messages list of errors, and read an empty sorries with no error as a closed proof. The Kimina Lean Server wraps that REPL in a FastAPI pool with an import-header cache so Mathlib isn’t re-imported per proof — the prover’s answer to NeMo-Skills pooling code workers. So the verifier is the same run(code) -> result box from where the executor runs, with a Lean server where the IPython kernel sat and run returning {ok, errors}:
# verifier sandbox — same interface as the IPython one, Lean underneath
@app.cls(image=lean_image, cpu=8.0, memory=32768, scaledown_window=120)
class LeanVerifier:
@modal.method()
def run(self, proof: str, timeout=300):
... # POST to the in-image kimina-lean-server; return {"ok": bool, "errors": [...]}The catch is lean_image. A Lean toolchain plus a built Mathlib is gigabytes and compiles for hours, and the prebuilt cache that skips the rebuild only matches an exactly-pinned toolchain version — the same brittle-pin trap as a pickled numba model. This is precisely the tedious, version-sensitive plumbing that does not belong in a blog post, so the working image, verifier server, and retry loop are in a companion repo rather than here. Modal’s own theorem-proving case study builds exactly this split — GPU generation, a CPU Kimina verifier in a Sandbox, orchestrator on top — and is the thing to crib from.
One piece I am deliberately leaving out: autoformalization. The solver eats natural language, but the prover eats a formal Lean statement, so a “prove this English sentence” pipeline needs a model to translate English into a Lean theorem … := by sorry first. That is a separate model — Kimina-Autoformalizer-7B, Goedel-Formalizer-V2 — and a separate, much less reliable problem: the formalizers land a faithful statement somewhere in the 55–75% range, against a verifier that is exact. So I treat it as an optional front-end, not part of the core loop. Start from a formal statement, as the benchmarks do, and the prover pipeline mirrors the solver almost line for line.
9 The generalist softening
The lines between the three flavours are softening, and the theory notes track the mechanism. Here I only care about the part that changes what I can rent.
The solver/prover split is blurring: DeepSeekMath-V2 (November 2025, Apache-2.0) writes natural-language proofs and checks its own work with a built-in verifier — gold-level on IMO 2025 — so a model can now carry its own check where a solver leaned on a known answer and a prover leaned on the compiler. And the generalist baseline is closing on the specialists: on the open maths leaderboards the top scores increasingly come from general frontier reasoners — Qwen3-235B-Thinking, Kimi-K2, the latest DeepSeek — rather than the narrow solvers and provers. And the generalists are, per the off-the-shelf table, exactly the models we can rent per-token.
A generalist will not drive the TIR loop as written, though. The fence-splice contract — write python, get output back — is something the maths-tuned models were post-trained to expect; a generalist keeps its reasoning in a separate channel (a reasoning_content field, not inline) and does tool use through the structured tools/tool_calls protocol: it emits a JSON call and expects a role: tool message back, not an output fence. So the executor is unchanged — still a run(code) -> str box — but the orchestrator needs a second contract: native tool-calls, or none at all, since many reasoners do competition maths better in pure chain-of-thought, sampled and \boxed-voted with no executor in the loop. Mind the token budget either way: the reasoning trace shares it, and a stingy cap returns an empty answer.4
The provers are drifting the same way, from a single fine-tuned Lean-emitter toward an agent that wraps one — often a plain generalist — in a decomposition-and-retry loop. Axiomatic AI’s minimal-agent paper is the case that lands closest to home: an off-the-shelf frontier model — Claude Opus, no fine-tuning — wrapped in feedback-plus-memory-plus-search beats the fine-tuned specialist provers outright, near $12 a problem, and gets better for free as the base model improves. That is the build-the-loop, rent-the-tokens thesis of this whole post, seen from the prover side: the scaffolding is the durable asset and the model inside it is a swappable commodity.
What I cannot cheaply reach is the actual top of that table. The current leaders — Aleph, Seed-Prover 1.5, Aristotle — are closed, pilot-only, and metered in dollars-per-problem rather than per-token, so the open provers I can self-host stay near the bottom. The cloud’s pull for me is therefore still the fan-out over cheap samples, not a frontier prover I can rent.
10 Talking to it
A loop that prints to a terminal is enough to debug, but a mathematician wants to read the output — which means rendering LaTeX, and for a solver also showing the executed code and its results, and for a prover the proof state and the compiler’s complaints. That turns out to be its own small research burden, and I have not resolved it, so here is the lay of the land.
For a TIR solver, Open WebUI is the least-effort option: it renders maths out of the box and ships a built-in Python code interpreter, so executed code and its output appear inline without extra wiring. Gradio’s ChatInterface is the lightest programmable surface — it renders $$…$$ via KaTeX by default (inline $…$ needs a latex_delimiters tweak, and single-$ has a history of bugs), and it has no sandbox of its own, which suits us because we already run the executor.
Two gotchas to budget for. Streaming breaks LaTeX: a $$ arrives before its closing pair, the parser sees malformed input, and the equation flickers as raw text until the delimiter closes — the maths case of the flash-of-incomplete-markdown problem, fixed by buffering until the delimiters balance. And no chat UI renders Lean proof state at all; that affordance is custom — but the Lean 4 web editor (lean4web) already runs a real Lean server with the goal panel and diagnostics, so we should embed it beside the chat rather than build our own.
A custom React app (react-markdown + remark-math + rehype-katex) buys full control over all of this at the cost of a day or two of plumbing. For now I would use Open WebUI to play, and defer the custom build until the loops are worth dressing up.
11 Privacy
A reasonable prior is “they all train on us.” The free tiers and chat clients are skeezy; buying tokens through an API is better. Stacking the two halves — who hosts and what their terms say — sorts the realistic paths from best to worst for sensitive work:
- Self-host on Modal — the strongest: the model author is out of the path and the terms disclaim accessing inputs/outputs/code, inference zero-retention.
- A managed zero-retention endpoint — the whole off-the-shelf shortlist is in this tier, all no-train: Novita (“shall not log, store, or retain… Inputs… or any Outputs”) for the 671B prover or Featherless (“we do not collect or store prompts or completions”) for the specialists. The provider is in the path but retains and trains on nothing.
- Self-host on RunPod — real technical isolation, but the IaaS ToS reserves an “aggregated and anonymized to improve the Service” content licence (Community Cloud adds a peer-operated host). Weaker on paper than tier 2, even though it wins on architecture.
- A first-party “may-train” API — DeepSeek direct,
build.nvidia, any default-logging endpoint. Keep unpublished proofs away from these.
OpenRouter is outside the ranking — a router inheriting each upstream’s policy; its data-policy filter can pin routing to zero-retention / no-train providers, but the cheapest free route often does neither, so set it deliberately.
Self-hosting wins on architecture; it does not automatically win on contract — which is why Modal, whose terms also disclaim content use, tops the list.
12 A path through
For a mathematician starting cold, in order of escalating effort:
- Start in easy mode: a rented frontier generalist with a Python/SymPy tool, a single careful pass — for everyday work the whole answer, the steps below only for when it is not enough.
- Build the loops on the laptop — a solver with a local TIR kernel, a prover with
lean-repl— because that plumbing is the part that transfers everywhere. - For Lean, the cheapest cloud step is not a GPU: meter DeepSeek-Prover-V2-671B on Novita and keep the compiler loop local.
- For heavier solver runs, pull specialist tokens from Featherless (subscription) or a scale-to-zero vLLM endpoint, executor still local.
- Graduate to a RunPod Pod (OProver’s full pipeline, big corpus) or Modal (endpoint plus sandboxes) only when the laptop and the metered endpoints run out.
- Keep unpublished work off the first-party “may-train” APIs; everything else on the shortlist is no-train.
- Wrap it in Open WebUI to read the working rather than tail a log; defer a custom frontend until the loops earn it.
13 See also
- The Journal of Artificial Mathematics [public] - Google Docs
- Neural nets that do symbolic mathematics — the theory: solvers vs provers, CoT and TIR, test-time scaling, why a compiler is an unfakeable verifier.
- Maths specialists on a Mac — which MLX and GGUF builds exist, which quant to pull.
- Code agents and assistants — the sibling applied notebook, for code rather than maths.
- A Lean prover on Modal — the companion repo: the Mathlib image, the Kimina verifier server, and the compile-and-retry orchestrator this post only sketches.
- Bolzano (source) — a colleague’s “automatic researcher” aimed at open problems, and the instructive foil for all of the above: the same three-role skeleton, but a soft LLM critic where we put a hard verifier and all-OpenRouter where we self-host — because with no oracle to check against, the architecture softens back to model-ensemble-plus-human.
Footnotes
AceMath is also CC-BY-NC — non-commercial only — wherever it runs.↩︎
Support is uneven among metered endpoints anyway — Novita exposes
n, OpenRouter ignores it and hands back one completion, and a flat-rate plan like Featherless caps us at its concurrency limit.↩︎Both models take the same prompt: a complete Lean 4 statement ending in
:= by sorry, wrapped in a “Complete the following Lean 4 code” instruction, and emit a proof plan followed by the proof. We read the answer off the last```lean4fence, exactly as we read the solver’s lastpythonfence.↩︎A 200 OK with usage billed and
contentempty, because the reasoning ate the budget before the answer arrived. Be generous with the cap.↩︎
