Running LLMs locally on a Mac

Osaurus, Ollama, JANG, DwarfStar, and the transformers path

2026-05-23 — 2026-06-12

Wherein the Mac Inference Ecosystem Is Surveyed Across Frontend, Runtime, and Format Layers, and macOS’s Hard Ceiling on Wired GPU Memory Is Identified as a Constraint Requiring Manual Override

computers are awful
machine learning
neural nets
NLP
UI
Figure 1

A twin post to front-end clients for AI image models, but for text. The local-LLM ecosystem on Macs is pretty luxurious, with a profusion of GUI options and Linux-y infra, and some specialised tooling that lags the community frontier but is not bad. Also, during the 2026 ramageddon, Macs suddenly look like remarkably good deals for high-RAM parallel-compute machines. I accidentally started going unreasonably deep and technical on this in the SOV repo. That repo really targets my coding assistant. Here is a human-facing version.

1 The stack

The cross-cutting vocabulary for AI-agent infrastructure — model, quantization format, runtime / inference engine, server / daemon, harness / agent loop, frontend / chat client — is defined once in AI agents, applied. The rest of this page assumes that vocabulary.

Quick reminder of where the Mac-specific tooling sits in those layers:

Most apps on this page are vertical bundles across several of those layers — Osaurus is frontend + harness + server + runtime in one, Ollama is server + runtime — but it pays to know which layer we’re looking at when an abstraction leaks (the tokenizer divergence below is a good example).

1.1 Compute backends

The compute backend is the runtime that runs the matmuls — where on the chip the work actually happens. Three of them cover local text inference on the M-series, and which one a tool picks sets both its speed and how soon it handles a new model.

  • PyTorch + MPS (Metal Performance Shaders) is the baseline. Most ML code reaches Apple Silicon through PyTorch, so coverage of new architectures arrives first — it is the lingua franca. Our embedding code runs here; speed is decent, not remarkable.
  • llama.cpp carries its own hand-written Metal kernels rather than going through MPS. It is the engine under Ollama and llama-server, fast and wide-coverage, and the one that eats GGUF.
  • MLX is Apple’s own array framework — faster still on the work it covers, less mainstream, and lagging by months on new architectures. Osaurus, mlx-lm, and the JANG stack all sit on it.

The image twin adds two backends that scarcely figure for text: CoreML on the Neural Engine for the lowest-footprint path, and Draw Things’ custom Swift + Metal stack — see front-end clients for AI image models.

1.2 Storage backends

The storage backend is the on-disk format the weights ship in — and, past full precision, the quantization scheme baked into it. At the sizes worth running on a 128 GB Mac this is the load-bearing choice: a 70B model at full fp16 is ~140 GB and simply will not fit, so quantization — trading a little quality for a much smaller footprint — is what gets it resident at all. A 4-bit build of that same 70B lands near 40 GB. The calibration data deciding which weights keep their bits is the other, quality half of the story. Every format below except plain safetensors is doing this fitting job, and each pairs with a compute backend that can load it.

  • safetensors — the Hugging Face baseline, full precision; what PyTorch + MPS loads when the model fits without help.
  • GGUF — the llama.cpp format. Not Apple-specific (it runs on CUDA and CPU too), it has the widest coverage and the finest quant ladder — down to the very-low-bit IQ2/IQ3 imatrix quants for the tightest squeezes.
  • MLX builds (mlx-community) — the MLX-native format, Apple-only. Quantized to fit exactly like GGUF — the 4-bit builds are why a 70B loads at all — with added speed on the M-series; the cost is that coverage lags and the quant range is coarser.
  • JANG — mixed-precision MLX: per-tensor bit-widths instead of one width for the whole model. MLX’s answer to imatrix, closing GGUF’s quality-per-bit edge at the low end.

So the choice between GGUF and an MLX build is about speed, coverage, and how tight a squeeze we need. Rule of thumb: prefer an MLX build for the speed when one is published; reach for GGUF when no MLX port exists yet, when we want the cross-platform llama.cpp ecosystem, or when only a low-bit imatrix quant will fit the model at all.

The fixed MLX width is indeed sus though — worth verifying that MLX models work OK. I suspect that JANG solves this better, but it is even more fringe.

2 Just chat with a model

The fastest path from zero to local LLM is a desktop app that bundles a model browser, a chat window, and an inference engine. There seem to be several main contenders. The Mac-native Osaurus is IMO the best, but it’s a risky choice.

2.1 Osaurus

Osaurus (MIT, brew install --cask osaurus, osaurus-ai/osaurus) is Swift-native, no Electron, no Python, and behaves like a proper Mac app.

I’m YOLOing all in on this because it seems efficient and easy. It locks us in to the Mac ecosystem so might not be for everyone. Also, it’s run by one person, so the bus factor is 1, which is risky. But — it’s so good!

The window has a model picker, a chat pane, a status indicator; the inference engine underneath is MLX (Apple’s own array framework for the M-series), so it gets fast tokens per second, fully leveraging the hardware.

Osaurus is also a server — it exposes OpenAI-, Anthropic-, and Ollama-compatible HTTP endpoints on localhost:1337, seemingly all three at once, which is some deep magic. Once we have it installed, anything else we want to point at a local model (code editor, embedding pipeline, Claude-Code-style tool) can talk to this local endpoint. It is also a full agent harness — the project now pitches itself as “the AI harness for macOS”, and the README leads with agents, not the chat window. Each agent carries its own prompts and persistent memory; pointing one at a working folder grants file, search, and git tools, and on macOS 26+ a sandbox toggle adds shell access in an isolated Linux VM via Apple’s Containerization framework — though anything running in that VM is Linux-CPU-only, beyond the reach of MLX and MPS. It imports agentskills.io-format skills (and whole Claude plugins) from GitHub or local files, selecting them by RAG at runtime, and speaks MCP in both directions, server and client. The harness layer is model-agnostic, fronting cloud providers as happily as the local MLX runtime.

One CLI gotcha: the osaurus command is embedded in the app bundle, and only the Homebrew install links it onto PATH automatically. If it is missing, symlink it: ln -sf “/Applications/osaurus.app/Contents/Helpers/osaurus” “$(brew --prefix)/bin/osaurus”. The documentation claims it is ln -sf “/Applications/Osaurus.app/Contents/MacOS/osaurus” “$(brew --prefix)/bin/osaurus”; but I think this is a typo — that launches the app, not the CLI helper. The CLI is worth having: osaurus serve / stop / status / list / run <model> / mcp, plus a plugin manager.

Osaurus is also the intended runtime for a custom mixed-precision MLX quantization format called JANG.

2.2 Jan

Jan (brew install --cask jan) is a FOSS cross-platform option. It looks nice. The UI is built on the mildly cursed Electron, but the plus side is that it runs on Linux, Windows, and macOS. It supports both llama.cpp (via Cortex) and MLX backends. For cross-platform use it looks pretty good. For brute-force running non-Apple-optimised models, it seems to go.

Jan is closer to a frontend + harness bundle than a pure chat client. The Projects / Assistants / Agents / MCP Connectors quartet gives it a tool-calling agent loop — MCP servers connect under Settings → MCP Servers (Exa search and a Jan Browser MCP are pre-configured), models call those tools, and the Agents mode runs multi-step autonomous workflows (Jan v2 VL is pitched as a 49-step multimodal agent). Jan Server is the self-hosted multi-step-orchestration variant. So in stack-vocabulary terms Jan covers frontend + harness + server + runtime, much like Osaurus does — just with a more chat-shaped UI and the Electron tax.

2.3 LM Studio

LM Studio (brew install --cask lm-studio) is closed-source, relatively slick and turnkey, and not free for commercial use. It runs both llama.cpp and — since 0.3.4, October 2024 — its own MIT-licensed mlx-engine (mlx-lm + Outlines + mlx-vlm), so MLX-native inference is not unique to Osaurus. I’m mildly sceptical of it because so many Cool LLM Technologies ship special bug fixes or alternate install paths for LM Studio, which hints at a slightly non-standard stack — though that might just be sampling bias, since more people file bug reports when more people run the thing.

3 Serving a model headless

Once we want a model serving as a daemon (“token fountain” we say at my work) rather than a chat window — a code editor, an embedding pipeline, a script that calls out to a local model — we want a long-lived daemon with an OpenAI-compatible API.

If we already have Osaurus running we are mostly done; it is already that daemon, listening on localhost:1337 with three flavours of compatible API. But also, for reasons of not installing an idiosyncratic stack, or superior customisation, we might want to install the standard Linux server stack.

3.1 Ollama

Ollama (brew install ollama) is a llama.cpp wrapper with its own model registry — fast enough, wide model coverage, and notably good for embedding models :

brew services start ollama
ollama pull qwen3.5  # LLM/chats etc
ollama pull mxbai-embed-large    # also handles embeddings

Anything OpenAI-API-compatible can now point at http://localhost:11434/v1. It’s smart to make it evict idle models, or it will load up many and tank the machine. OLLAMA_KEEP_ALIVE=15m does that.

Gotchas:

  • The .gguf files come from Ollama’s registry, not Hugging Face.
  • Some weird reimplementation headaches — e.g. the tokenizer baked into the GGUF can differ from the original for unclear reasons.
  • The “llama.cpp wrapper” framing is loosening: the registry now ships -mlx tags for some models (e.g. qwen3.5:35b-mlx).

3.2 Unsloth Studio

Unsloth is a different beast from Ollama — it’s a fine-tuning toolkit first, with a serving endpoint as a side effect. The headline claim is “2× faster training with 70% less VRAM” against vanilla transformers, achieved via custom kernels. Unsloth Studio wraps the toolkit in a browser-based no-code UI: ingest PDFs / CSVs / DOCX into a dataset, run training, export to GGUF or safetensors, evaluate side by side against the base model, and serve the result through a local OpenAI-compatible /v1 endpoint.

uv pip install unsloth
unsloth studio -p 8888    # browser UI on 127.0.0.1:8888

(The docs’ own example launches with -H 0.0.0.0, which binds every interface — leave that off unless we want the studio on the wifi.)

The reason to install Unsloth instead of (or alongside) Ollama is the fine-tuning side: train a LoRA on our own data, deploy it, query it, all from one place. The serving endpoint is the means of using what we just fine-tuned, not a competitor to Ollama as a generic backend.

Two Mac-specific caveats. First, Unsloth is CUDA-first; its signature kernels target NVIDIA GPUs. Apple Silicon support exists per the docs, but some paths fall back to slower implementations, and the VRAM-saving benchmarks are calibrated against a CUDA transformers baseline, not against MLX. I have not measured the Mac numbers myself. Second, if the priority is fine-tuning specifically on Apple Silicon, mlx-lm ships its own LoRA training path (mlx_lm.lora — it moved out of mlx-examples into its own repo) — fewer features, less polish, but native MLX kernels and no fallback paths.

3.3 (even-more-)Power-user options

llama.cpp itself ships a server: llama-server -m model.gguf exposes the same OpenAI-compatible endpoint with no daemon, no registry, no opinions. brew install llama.cpp. I’m not really sure when this would seem like a good idea? Ultrahobbyists? Trying to get a job at Anthropic? If I were going this deep I’d probably run it via transformers in my own Python process.

Apple Silicon optimised: mlx-lm ships its own server too: uv tool install mlx-lm then mlx_lm.server --model mlx-community/<repo>. One model per process; restart to switch.

4 Programmatic access via transformers

When we want to do things to a model — embed text, fine-tune, run interpretability tools, sample from internal layers, anything that touches the model internals — we drop down to Hugging Face transformers in our own Python process. It doesn’t need or want to be a server.

Embeddings for search are why I currently do this. For this I want sentence-transformers (uv pip install sentence-transformers — pulls torch with it), a thin wrapper around transformers that exposes the embedding API:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
embs = model.encode(texts, normalize_embeddings=True, convert_to_numpy=True)

The first call downloads the weights, the tokenizer config (tokenizer.json), and the model config from huggingface.co into ~/.cache/huggingface/hub/. Inference runs in our process via PyTorch. Tokenization runs in the same process, via HF’s Rust tokenizers library reading the same tokenizer.json the model was published with. One process, one library, one set of files.

On Apple Silicon we get a 15× speedup over fp32 by switching to float16 on MPS, with indistinguishable quality:

import torch
gpu = torch.cuda.is_available() or torch.backends.mps.is_available()
kwargs = {"model_kwargs": {"dtype": torch.float16}} if gpu else {}
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1", **kwargs)

The fp16 path matters on GPU or Apple Silicon, otherwise stay in fp32.

For text generation (rather than embedding) the equivalent is AutoModelForCausalLM.from_pretrained(...). PyTorch is rarely the fastest path on Apple Silicon — llama.cpp and MLX usually win on tokens-per-second — but it is the path that lets us see what the model is doing without arsing around. Activations, attention patterns, hidden states, custom sampling etc. are all possible from the Python prompt.

5 Models worth auditioning

Once we have a stack running, the next question is what to pull through it. A non-exhaustive list of picks worth running on a 128 GB Mac — the tier-shift matters here, since the “constrained” picks people post about for 16 GB laptops are not the ceiling for us, they’re the starting point.

5.1 Getting models into Osaurus

A mechanical note before the tables, because it confused me: Osaurus has no CLI download command, and its in-app Model Manager (⌘⇧M → Models) browses a curated catalogue — the Gemma-4 family, Qwen3.5/3.6, and the JANG zoo — not the whole of Hugging Face. A model being absent from that menu doesn’t mean it won’t run; it means we sideload it. Osaurus discovers anything dropped into its models directory:

hf download mlx-community/DeepSeek-R1-0528-Qwen3-8B-4bit \
  --local-dir ~/MLXModels/deepseek-r1-0528-qwen3-8b-4bit
osaurus list    # confirms discovery, and gives the exact API name for `osaurus run`

So in the tables below, “MLX via Osaurus” means this sideload unless the row says the model is in the curated menu. One more switch worth knowing: to keep two models resident at once (say the agentic daily-driver plus the maths model), set Settings → Local Inference → Model Management to Flexible — under the default Strict policy, loading one evicts the other.

5.2 For mathematical reasoning

Model Size (MLX 4-bit) Efficient path Why
DeepSeek-R1-0528-Qwen3-8B ~5 GB MLX sideloaded into Osaurus (mlx-community/DeepSeek-R1-0528-Qwen3-8B-4bit; Qwen3-8B arch, so any MLX runtime handles it); Ollama GGUF as the portable fallback The current small-model math champion — AIME-2024 86%, matching Qwen3-235B-thinking on that benchmark specifically (it trails on AIME-2025 and GPQA, so don’t over-generalise). The baseline to compare everything else against.
Phi-4-Reasoning-Plus-14B ~8 GB MLX sideloaded into Osaurus; GGUF/Ollama A different reasoning-trace style from DeepSeek’s distills — useful to triangulate when DeepSeek gives an answer we’re unsure about, not a stronger math model (it trails the 8B on AIME-2024).
Nemotron-Cascade-2-30B-A3B ~24 GB (Ollama) Ollamaollama run nemotron-cascade-2 (256K context, thinking + tools) from the official library, or bartowski’s imatrix GGUF (ollama run hf.co/bartowski/nvidia_Nemotron-Cascade-2-30B-A3B-GGUF:Q5_K_M; skip Q6 — it’s Q8-sized here). MLX: mlx-community ships standard quants (4-bit ~17 GB up to mxfp8 ~33 GB) that vanilla mlx_lm.server / mlx_lm.chat load fine — the #1266 gate bug fires only on JANG-style per-path quant configs, not standard conversions. JANG quants exist (JANG_2L ~10 GB / JANG_4M ~17 GB) but currently load only through the Python jang-tools stack: an Osaurus sideload of JANG_4M dies at weight-load (Unhandled keys [“biases”, “scales”, “weight”] … TurboQuantSwitchLinear, tested June 2026) — the Swift engine implements only the TurboQuant path for nemotron_h, and no JANGTQ build of Cascade-2 exists. The curated-menu omission is load-bearing. NVIDIA’s IMO-2025-gold math model (vendor-reported). The architecture is the weird part: a hybrid of Mamba (a state-space model, not a transformer) and MoE, pitched for linear context scaling so long tool-call histories don’t blow up RAM the way they do on dense attention. Audition it against “does linear-scaling attention show up in actual work?”

(I’ve dropped the older DeepSeek-R1-Distill-Qwen-32B here — it’s a Qwen2.5-era distill that the 0528-Qwen3-8B now matches on math at a quarter the size.)

Two Nemotrons, don’t confuse them. Cascade-2 above and the Nemotron-3-Nano-Omni-30B-A3B that Osaurus ships by default are both 30B-A3B post-trains off the same Nemotron-3-Nano base — same hybrid Mamba-MoE skeleton, different specialisation. Cascade-2 is the text reasoning/math build (Ollama; JANG quants via the Python stack only); Nano-Omni is the multimodal one (image + audio + video in), which is exactly why JANG packaged it and Osaurus defaults to it. So the Osaurus default is a local multimodal generalist, not the math champion — use it for “look at this screenshot / transcribe this” chat, and switch to Cascade-2 on Ollama (or the -Reasoning Omni variant) when we need it to actually think.

5.3 For agentic flows

These are the weights a coding harness wants behind it — Running a coding agent offline covers the harness/UI choice (OpenCode, pi, Cline, …) that points at them.

Model Size (MLX 4-bit) Efficient path Why
Hermes-4-14B (Nous) ~8 GB MLX sideloaded into Osaurus (mlx-community/Hermes-4-14B-4bit); GGUF/Ollama Nous’s smaller tier (a Qwen3-14B post-train). The hermes-function-calling-v1 training dataset is the giveaway — tool-calling is the design centre.
Hermes 4.3-Seed-36B ~20 GB Ollama (official Nous GGUF) or community MLX quants December 2025 release, built on ByteDance’s Seed-OSS-36B. Roughly matches Hermes-4-70B at half the size; its edge over the 14B is steerability and structured-output adherence rather than context (it’s capped at 32k). The efficient agentic daily-driver.
Hermes-4-70B ~40 GB MLX sideloaded into Osaurus (mlx-community/Hermes-4-70B-4bit); GGUF/Ollama Fits 128 GB comfortably (Llama-3.1-70B base), marginally stronger than the 36B on raw benchmarks — but since the 36B matches it at half the memory, run it only to check whether the extra size earns its throughput cost.
Qwen3.6-35B-A3B ~22 GB (MXFP4) Osaurus one-click — the curated menu carries MXFP4 + MTP and JANGTQ builds; also Unsloth’s MLX MXFP4 (27B) and llama.cpp/vLLM with MTP since May 2026 The agentic daily-driver as of April 2026. Same 3B-active MoE shape as its predecessor but reportedly well ahead of it (73.4% SWE-bench claimed), 1M context, natively vision-capable, and ships a trained multi-token-prediction head — ~1.4–2.2× faster decode with no quality cost. Caveat: the Mac MTP path is weeks old (Osaurus hardened it mid-May); if output loops, suspect MTP first.
Qwen3.5-35B-A3B ~23 GB Osaurus one-click — in the curated menu as qwen3.5-35b-a3b-jang_4k (or _2s, ~12 GB); also Ollama (qwen3.5:35b, or qwen3.5:35b-mlx for the MLX build), or MLX community quants The conservative fallback now that 3.6 has landed — same MoE shape, two months more battle-testing, no MTP novelty. The SOV-doc default (written pre-3.6).

5.4 What I would skip (for now)

NousCoder-14B is not what its name suggests in the agentic sense. It’s a Qwen3-14B post-trained via RL on 24k one-shot competitive-programming problems. The model card is explicit: LiveCodeBench v6 Pass@1 67.87%, up 7.08% from baseline. That’s a model for “give me a hard algorithm problem and I will solve it by myself”, not “run a tool-using coding agent over my repo”. Use Hermes-4 for the latter; pick NousCoder if Codeforces-style problems are the actual use case.

5.5 Suggested audition order

  1. DeepSeek-R1-0528-Qwen3-8B for math; Qwen3.6-35B-A3B (MXFP4 + MTP, one click in Osaurus) for agentic. Two pulls, two control samples. If 3.6’s young MTP path misbehaves, drop back to Qwen3.5-35B-A3B or Hermes 4.3-Seed-36B.
  2. Phi-4-Reasoning-Plus-14B alongside DeepSeek for triangulation on hard math.
  3. Hermes-4-70B to test whether the 70B tier earns its higher throughput cost over the 36B at our workloads.
  4. Nemotron-Cascade-2 (Ollama — its JANG quants don’t load in Osaurus, see the table), specifically against long tool-call traces — the one Mamba experiment.
  5. DwarfStar (V4 Flash) is its own thing, covered below.

For the equivalent question on a serious GPU (single H100 to a full DGX), see the small-GPU section of the sovereign-compute companion.

6 Maths specialists

A narrower class than the agentic generalists above: models hard-specialised on mathematical reasoning, which trade away chat fluency and agentic competence to get there. For the generalist-vs-specialist framing and the natural-language solvers vs Lean provers split, see Mathematical reasoning models; for the harness side (TIR loops, prover compile loops, sandbox tiers, the format-mismatch failure mode with generic agent frameworks), see Reasoning harnesses; and for running these beyond the laptop — rented GPUs, hosted endpoints, and the privacy of each — see Maths and proof models, applied.

One Mac convenience up front: the solver-side specialists are nearly all Qwen2.5/Qwen3-based dense transformers, so MLX handles them and they load in Osaurus — by sideload; none of them are in the curated menu. Unlike the Nemotron hybrid, there is no architecture gap to work around here.

6.1 Solving problems

Model Size Mac path Why
OpenMath-Nemotron-32B 32B (~18 GB 4-bit) MLX (mlx-community/OpenMath-Nemotron-32B-4bit) + GGUF NVIDIA’s AIMO-2 winner. AIME-2024 78.4 in tool mode. The best local solver; plain reasoning runs in any backend, the tool and selection modes want NeMo-Skills.
Qwen2.5-Math-72B-Instruct 72B (~40 GB 4-bit) MLX 8-bit + GGUF The push-button tool-mode pick — Qwen-Agent drives its code loop with no special infrastructure. Native in both modes.
Skywork-OR1-Math-7B 7B GGUF (MLX unconfirmed) Best small pure-reasoning specialist — AIME-2024 69.8 at 7B, long traces.

The 14B OpenMath-Nemotron is the efficiency cut: most of the 32B’s numbers at half the footprint, with MLX and GGUF both present.

These models run in both CoT and TIR modes; the headline AIME scores are maj@k figures with test-time scaling switched on, so a single local greedy decode at \(T = 0\) will look notably worse than the marketing number.

6.2 Running TIR on a Mac

For the TIR loop itself and the four ways to wire it up (roll-our-own / Qwen-Agent / NeMo-Skills / generic), see Reasoning harnesses. A handful of Mac-specific runner caveats:

  • Qwen-Agent. Current main runs the sandbox in Docker, so we need Docker Desktop or Colima. Older releases ran the kernel on the host directly.
  • NeMo-Skills. Two Mac wrinkles here. Host the model on mlx_lm.server, not Ollama — the tool path uses the raw /v1/completions text endpoint that Ollama’s shim handles poorly and mlx_lm.server handles properly. Use the documented no-Docker Python sandbox, since the Docker one wants --network=host, which is degraded on Docker Desktop for Mac.
  • Open WebUI’s Pyodide interpreter (SymPy and NumPy preloaded, no network) is the one point-and-click option that fires reliably — but only with a capable general model (Qwen3 30B-class and up), not the maths specialists. Open Interpreter is unmaintained as of early 2026.

For sandboxing model-written Python on a Mac: Linux seccomp doesn’t apply at all, so use the Docker --network none pattern documented in Sandboxing model-generated code rather than reaching for the Linux primitives.

6.3 Proving theorems

Whole-proof Lean provers emit Lean 4 and little else useful — hard-specialisation taken to its limit; a Lean prover can’t hold a conversation. For why the compile-feedback loop avoids false positives (and therefore why provers are sampled at Pass@32+ without anyone worrying), see Proving theorems with a compiler in the loop; for the lean-repl / Pantograph / LeanDojo harness tier and Mathlib version pinning, see Proof self-correction loops.

Model Size Mac path Why
Goedel-Prover-V2-32B 32B MLX 8-bit (mlx-community/Goedel-Prover-V2-32B-8bit) + GGUF The one top-tier prover with a ready MLX build. ~88–90% miniF2F, whole-proof plus self-correction. Start here.
OProver-32B 32B GGUF only The current open-weight leader (~93% miniF2F Pass@32), but that score is trained-in: it’s an agentic prover whose policy bakes in retrieval plus Lean-compiler feedback (the ablation calls feedback the dominant factor), and the official pipeline runs on vLLM + a Kimina Lean Server. On the Mac we get the GGUF weights for plain whole-proof generation; the agentic loop that earns the headline score is a Linux/CUDA affair. No MLX build.
DeepSeek-Prover-V2-7B 7B MLX + GGUF, every quant Lowest-friction on-ramp; lower ceiling than the 32Bs.

Autoformalization (turning a natural-language statement into Lean) is the awkward step locally — the clearest specialist (Herald) ships safetensors only, no GGUF or MLX, so on the Mac that conversion currently has to come from a hand-written statement or a hosted endpoint.

7 The JANG ecosystem

Osaurus wraps osaurus-ai/vmlx-swift-lm, a Swift MLX inference engine. That engine is a Swift port of jjang-ai/vmlx, a Python engine. Both load weights from the JANGQ-AI Hugging Face org, a zoo of mixed-precision quantised models in a custom format called JANG — converted with JANG Studio, a native macOS wizard, with the newer codebook variant branded JANGTQ (“JANG TurboQuant”). The same person wrote each of those — Jinho “Eric” Jang (Irvine, California; also Osaurus’s lead/only engineer). There is a parallel desktop app, MLX Studio, by the same author, running the Python engine and surfacing more experimental features (image generation, agentic tool calling, in-app model conversion). The Jang family of enterprises is a tightly integrated stack: runtime, quant format, model zoo, two GUIs — one developer. That vertical integration buys fast iteration and a coherent feature set across the chain; it also means that if Jang loses interest, switches jobs, or gets hit by a bus, the lot — JANG quants and JANG-format model files included — becomes abandonware. There is some community wariness about exactly this; see the r/LocalLLaMA “Is MLX Studio legit?” thread. It is all open source, so in principle we could maintain it ourselves if he walks away.

Jang is, on the visible evidence, a talented coder. His public GitHub profile records ~4,000 contributions in the last year, which is a lot even for AI-assisted coding, assuming it actually works (which it does). His JANG repo lists a steady stream of model-architecture support landing on the order of days after each new release. On Apple Silicon at the high end he is doing things nobody else is doing.

7.1 Tech stack

JANG (“Jang Adaptive N-bit Grading”) is mixed-precision quantisation for MLX. Standard MLX quantisation compresses every tensor to the same bit width. JANG classifies tensors by sensitivity — attention and MoE router layers (small share of params, large share of model behaviour) get 6–8 bits; expert MLPs get 2–4. The hybrid network is a mildly extended version of the standard MLX safetensors format with a per-tensor bit-width manifest. At the same total size, accuracy improves, notionally. The pitch is “GGUF for MLX”, which … sounds fine? I’m not really competent to judge. Apparently llama.cpp’s K-quants do something similar. The other half of the GGUF quality story is the imatrix calibration data fed in at quantisation time — which is why a bartowski/…-GGUF repo (like the Nemotron one above) is a slightly different, usually better thing than a bare K-quant of the same weights.

jangq.ai claims impressive performance: at time of writing, JANG_2L at 82.5 GB scoring 74% MMLU against MLX 4-bit at 119.8 GB scoring 26.5% on MiniMax-M2.5; 397B-parameter models fitting on 128 GB Macs. The page also notes it has been “filtered to decisive smaller wins only” — close comparisons and unfavourable cases are not shown — so calibrate accordingly. At least one third-party benchmark is pretty impressed.

Osaurus is JANG native. Pull a model from JANGQ-AI and it loads — mostly. The Swift engine’s coverage tracks the JANGTQ path; a plain JANG_* quant of an exotic architecture can fail at weight-load (the Cascade-2 sideload failure above — the Python jang-tools stack handles those, the Swift engine does not yet). Elsewhere, support is partial: MLX Studio and vMLX natively, oMLX via a merged integration (per the JANG README), LM Studio / Ollama / Jan not yet. From Python: uv pip install “jang[mlx]”, then jang_tools.loader.load_jang_model(...).

7.2 MLX Studio

MLX Studio is the same author’s other Mac desktop app — Electron + Python rather than Swift, broader feature surface (image generation via Flux and Z-Image, ~26 built-in agentic tools, in-app GGUF→MLX and MLX→JANG conversion, an Anthropic-compatible API). Install via the signed DMG on the releases page, or engine-only with uv tool install vmlx and vmlx serve mlx-community/<repo> (OpenAI-compatible on localhost:8000). It doesn’t quite land for me, at least not in comparison to Osaurus.

8 Antirez and DwarfStar

There is another weird Mac-only stack of interest to me: Salvatore Sanfilippo — antirez, the author of Redis — wrote some custom Apple Silicon inference code to run DeepSeek V4 Flash on a 128 GB MacBook, and a whole tiny supergroup of famed developers has grown up around it.

The approximate trajectory is as follows. April 2026: apparently moments after the DeepSeek V4 release, antirez drops antirez/llama.cpp-deepseek-v4-flash, a fork of llama.cpp with 2-bit quantisation, plus the matching GGUF at antirez/deepseek-v4-gguf.

A month later, he drops a from-scratch native Metal inference engine, ds4 (DwarfStar 4 to its friends) narrowly targeting DeepSeek V4 Flash and, I guess, a narrow family of derivatives. Targets M3 Max, M3 Ultra, and M5 Max specifically. Reported numbers are pretty snappy — ~14–15 tok/s decode at 62K context on an M3 Max 128 GB, ~450 tok/s prompt-processing on an M5 Max for a 10k-token codebase.

Like JANG, this is a small, specialised stack run by one guy; except, as we see below, there is a little more community buy-in.

8.1 Running DwarfStar via the pi stack

The default harness for ds4 seems to be: pi, an MIT-licensed agent harness by Mario Zechner (badlogic, of libGDX fame) — itself a strong offline coding agent once a model is behind it. There is an easy install via the pi extension by Armin Ronacher (mitsuhiko, of Flask): mitsuhiko/pi-ds4. It handles process management for ds4-server — per-PID leases, watchdog shutdown, OpenAI-compatible local endpoint on 127.0.0.1:8000:

pi install https://github.com/mitsuhiko/pi-ds4

First-time install clones antirez/ds4, builds it, downloads the GGUF (~87 GB), and registers a ds4/deepseek-v4-flash model with pi. Subsequent runs spawn the server on demand and shut it down when no client process holds a lease. OpenClaw embeds pi, so the same extension can in principle load there.

pi from the terminal opens a TUI (“textual user interface” — I think that’s what it means, i.e. it lives in the terminal).

Audrey Tang maintains audreyt/pi-ds4, a fork that swaps in cyberneurova’s abliterated IQ2XXS quants and turns on uncertainty-mode directional steering by default — an activation-space edit that puts the model into “this is a contested question” mode on CCP-sensitive topics (Taiwan, Crimea, Kashmir, Western Sahara).

8.2 Manual setup for non-pi harnesses

Outside the pi ecosystem, the manual setup is four commands plus a config edit.

# antirez/ds4 for upstream; also audreyt/ds4 looks cool
# optimisations + steering-vector work — pick one
git clone https://github.com/audreyt/ds4
cd ds4
make
tmutil addexclusion -p (realpath ./gguf)
./download_model.sh                     # ~87 GB into ./gguf/
./ds4-server                            # listens on 127.0.0.1:8000

For lifecycle, we could wrap ./ds4-server in a launchd plist with KeepAlive: true; this is probably not what we want on a typical laptop where we do other things than inference, like, you know, use it as a laptop. I think pi is more automatic in that regard.

ds4-server’s context window is set at launch via --ctx <tokens> (max accepted per conversation) and --tokens <n> (default max output); --kv-cache <path> spills the KV store to disk for long-context runs, and --think turns on DeepSeek’s reasoning mode. DeepSeek V4 Flash nominally supports 1M tokens, but ds4 is RAM-bound: the 2-bit IQ2XXS weights are ~81 GB, and a full 1M-token KV/index sits around 26 GB on top. Rough budget on unified memory:

  • 64 GB: 50k–150k --ctx with headroom.
  • 96 GB: 150k–250k works but is tight; quit Slack.
  • 128 GB: 200k–300k is comfortable; >300k starts risking OOM.
  • 1M: only with very generous memory and nothing else running.

If a client (Hermes, OpenClaw, OpenCode, anything OpenAI-compatible) advertises a context larger than --ctx, requests will get cut off — match the client’s contextWindow / limit.context to the server’s --ctx. DeepSeek’s sparse attention means raising --ctx doesn’t blow up compute the way dense attention would, but RAM is still the binding constraint. For most interactive coding, 32k–100k plus a retrieval layer beats brute-forcing the whole history into the prompt. See antirez/ds4’s README and the OpenClaw ds4 provider docs for the full flag list and client-side config.

Reasonable defaults:

./ds4-server --ctx 200000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

To bake in audreyt’s directional-steering defaults:

DS4_DIR_STEERING_FFN=-0.75 \
DS4_DIR_STEERING_ATTN=0 \
DS4_REPRODUCIBLE=1 \
./ds4-server

To use with Hermes, add an OpenAI-compatible provider entry to the Hermes config (sketch — confirm the exact schema with hermes config):

# ~/.hermes/config.yaml
custom_providers:
- name: ds4
  base_url: http://127.0.0.1:8000/v1
  model: deepseek-v4-flash
  models:
    deepseek-v4-flash:
      context_length: 200000

From inside Hermes, /model ds4/deepseek-v4-flash — matching the provider name in the YAML above. Done.

Anyway, this gets us a generic token endpoint, so we’re free to plug in whatever on the front end.

The protagonists of this play have a lot of clout — antirez (Redis), mitsuhiko (Flask), badlogic (libGDX), and audreyt (Taiwan’s former Digital Minister, Pugs / Perl 6). Some kind of critical mass seems feasible for a certain type of nerd.

9 Memory management

We need to think about how much memory our machine has overall, and how much of that it will let us use for MLX workloads.

On the first point, TIL that macOS’s “Memory Used” indicator does not measure how much RAM is committed in the way I assumed. It counts caching usage in some unproductive way. “Memory” — green / yellow / red in Activity Monitor, or Pages purgeable and Pages compressed from vm_stat — measures available RAM. macOS aggressively fills RAM with discardable file-cache pages. mactop is a handy resource monitor that doesn’t itself use too much memory.

On the second point, there are limits on how much any given process is allowed to take up of the precious system memory. macOS sets a hard limit on how much RAM Metal — and therefore MLX — is allowed to wire (lock into physically resident, GPU-accessible memory). The default is ~67% on Macs ≤36 GB and ~75% on larger ones. On a 128 GB Mac that means MLX refuses to allocate past ~96 GB, regardless of how much actually-free memory there is. Raise it at runtime:

# Cap MLX at 112 GB — leaves ~16 GB for the OS and other apps
sudo sysctl iogpu.wired_limit_mb=114688

# Confirm
sudo sysctl iogpu.wired_limit_mb

# Reset to default
sudo sysctl iogpu.wired_limit_mb=0

This does not persist across reboots — we would need to wrap it in a LaunchDaemon or /etc/sysctl.conf entry to make it sticky.

Setting it to the full 128 GB is not wise. If MLX wires more than the OS can spare, the machine kernel-panics.

The runtimes also manage this themselves, each in its own way: mlx-lm wires the memory occupied by model and cache when a model is large relative to RAM (macOS 15+), and the Swift stack under Osaurus exposes wired-memory policies and tickets that raise the process limit around active inference rather than pinning one fixed number.

But also, before launching a big run:

  • sudo purge flushes the file cache so the OS has clean room to allocate. Available RAM jumps; subsequent file I/O is slower until the cache refills.
  • Quit Electron apps. Slack, Discord, Cursor, VS Code, Chrome will routinely pin 4–8 GB each.
  • MLX_LM_CACHE_LIMIT=0 (env var) prevents MLX’s internal allocation cache from growing unboundedly during long sessions — useful for sustained embedding or agent workloads.

10 Feeding PDFs in

This section grew until it became its own notebook: PDF ingestion — which Mac frontends read PDFs natively, the converter decision (markitdown / marker / Granite-Docling / MinerU), the empirically-debugged recipes, and the pdf-ingest agent skill.

11 Let’s break things

I came to understand the transformers/llama.cpp split by breaking it.

The Hugging Face and Ollama versions of mxbai-embed-large are nominally the same model — same upstream weights — but each stack implemented its own tokenizer. On plain prose the two mostly agree, I think; on markdown they can disagree by a few percent on how many tokens a chunk takes. Best not to mix and match. For embeddings on this blog I went all-transformers. For chat through a live server — where tokenisation stays internal to one stack and we eyeball the output — Ollama is fine.

12 Excluding model dirs from background services

The model weights are enormous and waste space in backups. There is no point

  • backing up a quantised .gguf we can pull again in two commands, nor
  • indexing .safetensors files for Spotlight — they are opaque binary blobs and Spotlight will spin happily for hours grinding nothing useful out of them.

Solution!

# One list, two background services to opt out of
model_dirs=(
  ~/.cache/huggingface
  ~/.cache/uv
  ~/.ollama/models
  ~/.lmstudio
  "$HOME/Library/Application Support/Jan/data/llamacpp/models"
  "$HOME/Library/Application Support/Jan/data/mlx/models"
  ~/MLXModels                          # Osaurus default (override: OSU_MODELS_DIR)
  ~/.mlxstudio/models                  # MLX Studio default
)

# Time Machine — sticky exclusion keyed to the path string
for d in "${model_dirs[@]}"; do
  [ -d "$d" ] && sudo tmutil addexclusion -p "$d"
done

# Spotlight — drop the Apple-documented marker file in each directory
for d in "${model_dirs[@]}"; do
  [ -d "$d" ] && touch "$d/.metadata_never_index"
done

# Confirm a few
tmutil isexcluded ~/.cache/huggingface
ls -la ~/MLXModels/.metadata_never_index

.metadata_never_index is the Apple-supported marker file that tells mds_stores to skip the directory and everything under it; the file is empty and the marker is the filename.

If we ever want to re-index a directory (model dir promoted to “actual content”), rm .metadata_never_index and mdimport -r <dir> puts it back.

13 See also