Running LLMs locally on a Mac
Osaurus, Ollama, and the transformers path
2026-05-23 — 2026-05-31
Wherein the Local-LLM Stack on Apple Silicon Is Surveyed From Weights to Frontend, Single-Developer Runtimes by the Authors of Redis and Flask Are Encountered, and a Wired-Memory Ceiling Is Noted.
A twin post to front-end clients for AI image models, but for text. The local-LLM ecosystem on Macs is pretty luxurious, with a profusion of GUI options and Linux-y infra, and some specialised tooling that lags the community frontier but is not bad. Also, during the 2026 ramageddon, Macs suddenly look like remarkably good deals for high-RAM parallel-compute machines. I accidentally started going unreasonably deep and technical on this in the SOV repo. That repo really targets my coding assistant. Here is a human-facing version.
1 The stack
The cross-cutting vocabulary for AI-agent infrastructure — model, quantization format, runtime / inference engine, server / daemon, harness / agent loop, frontend / chat client — is defined once in AI agents, applied. The rest of this page assumes that vocabulary.
Quick reminder of where the Mac-specific tooling sits in those layers:
- Runtime:
llama.cpp, MLX,mlx-lm,vmlx-swift-lm, antirez’sds4. - Server:
ollama serve,llama-server,mlx_lm.server, Osaurus,ds4-server, Unsloth Studio’s local endpoint. - Harness: Osaurus has one built in; pi is a popular standalone one.
- Frontend: Osaurus’s chat window, Jan, LM Studio, MLX Studio.
Most apps on this page are vertical bundles across several of those layers — Osaurus is frontend + harness + server + runtime in one, Ollama is server + runtime — but it pays to know which layer we’re looking at when an abstraction leaks (the tokenizer divergence below is a good example).
2 Just chat with a model
The fastest path from zero to local LLM is a desktop app that bundles a model browser, a chat window, and an inference engine. There seem to be several main contenders. The Mac-native Osaurus is IMO the best, but it is a risky choice.
2.1 Osaurus
Osaurus (MIT, brew install --cask osaurus, osaurus-ai/osaurus) is Swift-native, no Electron, no Python, and behaves like a proper Mac app.
I’m YOLOing all in on this because it seems efficient and easy. It locks us in to the Mac ecosystem so might not be for everyone. Also, it’s run by one person, so the bus factor is 1, which is risky. But — it’s so good!
The window has a model picker, a chat pane, a status indicator; the inference engine underneath is MLX (Apple’s own array framework for the M-series), so it gets fast tokens per second, fully leveraging the hardware.
Osaurus is also a server — it exposes OpenAI-, Anthropic-, and Ollama-compatible HTTP endpoints on localhost:1337, seemingly all three at once, which is some deep magic. Once we have it installed, anything else we want to point at a local model (code editor, embedding pipeline, Claude-Code-style tool) can talk to this local endpoint. There is also an agent harness with MCP server-and-client; think Claude Desktop, maybe slightly rougher but with some extra features.
Osaurus is also the intended runtime for a custom mixed-precision MLX quantization format called JANG.
2.2 Jan
Jan is a FOSS cross-platform option. It looks nice. The UI is built on the mildly cursed Electron, but the plus side is that it runs on Linux, Windows, and macOS. It supports both llama.cpp (via Cortex) and MLX backends. For cross-platform use it looks pretty good. For brute-force running non-Apple-optimised models, it seems to go.
2.3 LM Studio
LM Studio is closed-source, relatively slick and turnkey, not free for commercial use, and still runs llama.cpp underneath. I’m mildly sceptical of it because so many Cool LLM Technologies ship special bug fixes or alternate install paths for LM Studio, which hints at a slightly non-standard stack — though that might just be sampling bias, since more people file bug reports when more people run the thing.
3 Serving a model headless
Once we want a model serving as a daemon (“token fountain” we say at my work) rather than a chat window — a code editor, an embedding pipeline, a script that calls out to a local model — we want a long-lived daemon with an OpenAI-compatible API.
If we already have Osaurus running we are mostly done; it is already that daemon, listening on localhost:1337 with three flavours of compatible API. But also, for reasons of not installing an idiosyncratic stack, or superior customization, we might want to install the standard Linux server stack.
3.1 Ollama
Ollama (brew install ollama) is a llama.cpp wrapper with its own model registry — fast enough, wide model coverage, and notably good for embedding models :
Anything OpenAI-API-compatible can now point at http://localhost:11434/v1. It’s smart to make it evict idle models, or it will load up many and tank the machine. OLLAMA_KEEP_ALIVE=15m does that.
Gotchas:
- The
.gguffiles come from Ollama’s registry, not Hugging Face. - Some weird reimplementation headaches — e.g. the tokenizer baked into the GGUF can differ from the original for unclear reasons.
3.2 Unsloth Studio
Unsloth is a different beast from Ollama — it’s a fine-tuning toolkit first, with a serving endpoint as a side effect. The headline claim is “2× faster training with 70% less VRAM” against vanilla transformers, achieved via custom kernels. Unsloth Studio wraps the toolkit in a browser-based no-code UI: ingest PDFs / CSVs / DOCX into a dataset, run training, export to GGUF or safetensors, evaluate side by side against the base model, and serve the result through a local OpenAI-compatible /v1 endpoint.
The reason to install Unsloth instead of (or alongside) Ollama is the fine-tuning side: train a LoRA on our own data, deploy it, query it, all from one place. The serving endpoint is the means of using what we just fine-tuned, not a competitor to Ollama as a generic backend.
Two Mac-specific caveats. First, Unsloth is CUDA-first; its signature kernels target NVIDIA GPUs. Apple Silicon support exists per the docs, but some paths fall back to slower implementations, and the VRAM-saving benchmarks are calibrated against a CUDA transformers baseline, not against MLX. I have not measured the Mac numbers myself. Second, if the priority is fine-tuning specifically on Apple Silicon, MLX-LM ships its own LoRA training path — fewer features, less polish, but native MLX kernels and no fallback paths.
3.3 (even-more-)Power-user options
llama.cpp itself ships a server: llama-server -m model.gguf exposes the same OpenAI-compatible endpoint with no daemon, no registry, no opinions. brew install llama.cpp. I’m not really sure when this would seem like a good idea? Ultrahobbyists? Trying to get a job at Anthropic? If I were going this deep I’d probably run it via transformers in my own Python process.
Apple Silicon optimised: mlx-lm ships its own server too: uv tool install mlx-lm then mlx_lm.server --model mlx-community/<repo>. One model per process; restart to switch.
4 Programmatic access via transformers
When we want to do things to a model — embed text, fine-tune, run interpretability tools, sample from internal layers, anything that touches the model internals — we drop down to Hugging Face transformers in our own Python process. It doesn’t need or want to be a server.
Embeddings for search are why I currently do this. For this I want sentence-transformers, a thin wrapper around transformers that exposes the embedding API:
The first call downloads the weights, the tokenizer config (tokenizer.json), and the model config from huggingface.co into ~/.cache/huggingface/hub/. Inference runs in our process via PyTorch. Tokenization runs in the same process, via HF’s Rust tokenizers library reading the same tokenizer.json the model was published with. One process, one library, one set of files.
On Apple Silicon we get a 15× speedup over fp32 by switching to float16 on MPS, with indistinguishable quality:
The fp16 path matters on GPU or Apple Silicon, otherwise stay in fp32.
For text generation (rather than embedding) the equivalent is AutoModelForCausalLM.from_pretrained(...). PyTorch is rarely the fastest path on Apple Silicon — llama.cpp and MLX usually win on tokens-per-second — but it is the path that lets us see what the model is doing without arsing around. Activations, attention patterns, hidden states, custom sampling etc. are all possible from the Python prompt.
5 Models worth auditioning
Once we have a stack running, the next question is what to pull through it. A non-exhaustive list of picks worth running on a 128 GB Mac — the tier-shift matters here, since the “constrained” picks people post about for 16 GB laptops are not the ceiling for us, they are the starting point.
5.1 For mathematical reasoning
| Model | Size (MLX 4-bit) | Efficient path | Why |
|---|---|---|---|
| DeepSeek-R1-0528-Qwen3-8B | ~5 GB | MLX via Osaurus (Qwen3-8B arch, so any MLX runtime handles it); Ollama GGUF as the portable fallback | The current small-model math champion — AIME-2024 86%, matching Qwen3-235B-thinking on that benchmark specifically (it trails on AIME-2025 and GPQA, so don’t over-generalize). The baseline to compare everything else against. |
| Phi-4-Reasoning-Plus-14B | ~8 GB | MLX via Osaurus; GGUF/Ollama | A different reasoning-trace style from DeepSeek’s distills — useful to triangulate when DeepSeek gives an answer we’re unsure about, not a stronger math model (it trails the 8B on AIME-2024). |
| Nemotron-Cascade-2-30B-A3B | ~24 GB (Ollama) | Ollama — ollama run nemotron-cascade-2 (256K context, thinking + tools), which wraps the llama.cpp GGUF. Pull Q5_K_M (ollama run hf.co/bartowski/nvidia_Nemotron-Cascade-2-30B-A3B-GGUF:Q5_K_M); skip Q6 — it’s Q8-sized here. MLX is the gap: mlx-lm still chokes on the hybrid nemotron_h MoE gate, so it’s absent from the JANG/Osaurus menu (which ships the unrelated multimodal Nemotron-3-Nano-Omni instead). |
NVIDIA’s IMO-2025-gold math model (vendor-reported). The architecture is the weird part: a hybrid of Mamba (a state-space model, not a transformer) and MoE, pitched for linear context scaling so long tool-call histories don’t blow up RAM the way they do on dense attention. Audition it against “does linear-scaling attention show up in actual work?” |
(I’ve dropped the older DeepSeek-R1-Distill-Qwen-32B here — it’s a Qwen2.5-era distill that the 0528-Qwen3-8B now matches on math at a quarter the size.)
Two Nemotrons, don’t confuse them. Cascade-2 above and the Nemotron-3-Nano-Omni-30B-A3B that Osaurus ships by default are both 30B-A3B post-trains off the same Nemotron-3-Nano base — same hybrid Mamba-MoE skeleton, different specialization. Cascade-2 is the text reasoning/math build (Ollama-only); Nano-Omni is the multimodal one (image + audio + video in), which is exactly why JANG packaged it and Osaurus defaults to it. So the Osaurus default is a local multimodal generalist, not the math champion — use it for “look at this screenshot / transcribe this” chat, and switch to Cascade-2 on Ollama (or the -Reasoning Omni variant) when we need it to actually think.
5.2 For agentic flows
| Model | Size (MLX 4-bit) | Efficient path | Why |
|---|---|---|---|
| Hermes-4-14B (Nous) | ~8 GB | MLX via Osaurus (mlx-community/Hermes-4-14B-4bit); GGUF/Ollama |
Nous’s smaller tier (a Qwen3-14B post-train). The hermes-function-calling-v1 training dataset is the giveaway — tool-calling is the design centre. |
| Hermes 4.3-Seed-36B | ~20 GB | Ollama (official Nous GGUF) or community MLX quants | December 2025 release, built on ByteDance’s Seed-OSS-36B. Roughly matches Hermes-4-70B at half the size; its edge over the 14B is steerability and structured-output adherence rather than context (it’s capped at 32k). The efficient agentic daily-driver. |
| Hermes-4-70B | ~40 GB | MLX via Osaurus (mlx-community/Hermes-4-70B-4bit); GGUF/Ollama |
Fits 128 GB comfortably (Llama-3.1-70B base), marginally stronger than the 36B on raw benchmarks — but since the 36B matches it at half the memory, run it only to check whether the extra size earns its throughput cost. |
| Qwen3.5-35B-A3B | ~23 GB | Ollama (qwen3.5:35b-a3b), MLX community quants, or JANG (which ships the newer 3.6) |
MoE daily-driver — 3B active, MLX-friendly. The SOV-doc default. |
5.3 What I would skip (for now)
NousCoder-14B is not what its name suggests in the agentic sense. It’s a Qwen3-14B post-trained via RL on 24k one-shot competitive-programming problems. The model card is explicit: LiveCodeBench v6 Pass@1 67.87%, up 7.08% from baseline. That’s a model for “give me a hard algorithm problem and I will solve it by myself”, not “run a tool-using coding agent over my repo”. Use Hermes-4 for the latter; pick NousCoder if Codeforces-style problems are the actual use case.
5.4 Suggested audition order
- DeepSeek-R1-0528-Qwen3-8B for math; Hermes 4.3-Seed-36B for agentic. Two pulls, two control samples.
- Phi-4-Reasoning-Plus-14B alongside DeepSeek for triangulation on hard math.
- Hermes-4-70B to test whether the 70B tier earns its higher throughput cost over the 36B at our workloads.
- Nemotron-Cascade-2 (Ollama — it’s not in the JANG/Osaurus menu), specifically against long tool-call traces — the one Mamba experiment.
- DwarfStar (V4 Flash) is its own thing, covered below.
For the equivalent question on a serious GPU (single H100 to a full DGX), see the small-GPU section of the sovereign-compute companion.
6 Maths specialists
A narrower class than the agentic generalists above: models hard-specialized on mathematical reasoning, which trade away chat fluency and agentic competence to get it. For the generalist-vs-specialist framing and the natural-language solvers vs Lean provers split, see Mathematical reasoning models; for the harness side (TIR loops, prover compile loops, sandbox tiers, the format-mismatch failure mode with generic agent frameworks), see Reasoning harnesses; and for running these beyond the laptop — rented GPUs, hosted endpoints, and the privacy of each — see Maths and proof models, applied.
One Mac convenience up front: the solver-side specialists are nearly all Qwen2.5/Qwen3-based dense transformers, so MLX handles them and they load in Osaurus. Unlike the Nemotron hybrid, there is no architecture gap to work around here.
6.1 Solving problems
| Model | Size | Mac path | Why |
|---|---|---|---|
| OpenMath-Nemotron-32B | 32B (~18 GB 4-bit) | MLX (mlx-community/OpenMath-Nemotron-32B-4bit) + GGUF |
NVIDIA’s AIMO-2 winner. AIME-2024 78.4 in tool mode. The best local solver; plain reasoning runs in any backend, the tool and selection modes want NeMo-Skills. |
| Qwen2.5-Math-72B-Instruct | 72B (~40 GB 4-bit) | MLX 8-bit + GGUF | The push-button tool-mode pick — Qwen-Agent drives its code loop with no special infrastructure. Native in both modes. |
| Skywork-OR1-Math-7B | 7B | GGUF (MLX unconfirmed) | Best small pure-reasoning specialist — AIME-2024 69.8 at 7B, long traces. |
The 14B OpenMath-Nemotron is the efficiency cut: most of the 32B’s numbers at half the footprint, MLX and GGUF both present.
These models run in both CoT and TIR modes; the headline AIME scores are maj@k figures with test-time scaling switched on, so a single local greedy decode at \(T = 0\) will look notably worse than the marketing number.
6.2 Running TIR on a Mac
For the TIR loop itself and the four ways to wire it up (roll-our-own / Qwen-Agent / NeMo-Skills / generic), see Reasoning harnesses. A handful of Mac-specific runner caveats:
- Qwen-Agent. Current
mainruns the sandbox in Docker, so we need Docker Desktop or Colima. Older releases ran the kernel on the host directly. - NeMo-Skills. Two Mac wrinkles here. Host the model on
mlx_lm.server, not Ollama — the tool path uses the raw/v1/completionstext endpoint that Ollama’s shim handles poorly andmlx_lm.serverhandles properly. Use the documented no-Docker Python sandbox, since the Docker one wants--network=host, which is degraded on Docker Desktop for Mac. - Open WebUI’s Pyodide interpreter (SymPy and NumPy preloaded, no network) is the one point-and-click option that fires reliably — but only with a capable general model (Qwen3 30B-class and up), not the maths specialists. Open Interpreter is unmaintained as of early 2026.
For sandboxing model-written Python on a Mac: Linux seccomp doesn’t apply at all, so use the Docker --network none pattern documented in Sandboxing model-generated code rather than reaching for the Linux primitives.
6.3 Proving theorems
Whole-proof Lean provers emit Lean 4 and little else useful — hard-specialization taken to its limit; a Lean prover can’t hold a conversation. For why the compile-feedback loop avoids false positives (and therefore why provers are sampled at Pass@32+ without anyone worrying), see Proving theorems with a compiler in the loop; for the lean-repl / Pantograph / LeanDojo harness tier and Mathlib version pinning, see Proof self-correction loops.
| Model | Size | Mac path | Why |
|---|---|---|---|
| Goedel-Prover-V2-32B | 32B | MLX 8-bit (mlx-community/Goedel-Prover-V2-32B-8bit) + GGUF |
The one top-tier prover with a ready MLX build. ~88–90% miniF2F, whole-proof plus self-correction. Start here. |
| OProver-32B | 32B | GGUF only | The current open-weight leader (~93% miniF2F Pass@32), but that score is trained-in: it’s an agentic prover whose policy bakes in retrieval plus Lean-compiler feedback (the ablation calls feedback the dominant factor), and the official pipeline runs on vLLM + a Kimina Lean Server. On the Mac we get the GGUF weights for plain whole-proof generation; the agentic loop that earns the headline is a Linux/CUDA affair. No MLX build. |
| DeepSeek-Prover-V2-7B | 7B | MLX + GGUF, every quant | Lowest-friction on-ramp; lower ceiling than the 32Bs. |
Autoformalization (turning a natural-language statement into Lean) is the awkward step locally — the clearest specialist (Herald) ships safetensors only, no GGUF or MLX, so on the Mac that conversion currently has to come from a hand-written statement or a hosted endpoint.
7 The JANG ecosystem
Osaurus wraps osaurus-ai/vmlx-swift-lm, a Swift MLX inference engine. That engine is a Swift port of jjang-ai/vmlx, a Python engine. Both load weights from the JANGQ-AI Hugging Face org, a zoo of mixed-precision quantised models in a custom format called JANG — converted with JANG Studio, a native macOS wizard, with the newer codebook variant branded JANGTQ (“JANG TurboQuant”). The same person wrote each of those — Jinho “Eric” Jang (Irvine, California; also Osaurus’s lead/only engineer). There is a parallel desktop app, MLX Studio, by the same author, running the Python engine and surfacing more experimental features (image generation, agentic tool calling, in-app model conversion). The Jang family of enterprises is a tightly integrated stack: runtime, quant format, model zoo, two GUIs — one developer. That vertical integration buys fast iteration and a coherent feature set across the chain; it also means that if Jang loses interest, switches jobs, or gets hit by a bus, the lot — JANG quants and JANG-format model files included — becomes abandonware. There is some community wariness about exactly this; see the r/LocalLLaMA “Is MLX Studio legit?” thread. It is all open source, so in principle we could maintain it ourselves if he bounces.
Jang is, on the visible evidence, a gun coder. His public GitHub profile records ~4,000 contributions in the last year, which is a lot even for AI-assisted coding, assuming it actually works (which it does). His JANG repo lists a steady stream of model-architecture support landing on the order of days after each new release. On Apple Silicon at the high end he is doing things nobody else is doing.
7.1 Tech stack
JANG (“Jang Adaptive N-bit Grading”) is mixed-precision quantisation for MLX. Standard MLX quantisation compresses every tensor to the same bit width. JANG classifies tensors by sensitivity — attention and MoE router layers (small share of params, large share of model behaviour) get 6–8 bits; expert MLPs get 2–4. The hybrid network is a mildly extended version of the standard MLX safetensors format with a per-tensor bit-width manifest. At the same total size, accuracy improves, notionally. The pitch is “GGUF for MLX”, which … sounds fine? I’m not really competent to judge. Apparently llama.cpp’s K-quants do something similar. The other half of the GGUF quality story is the imatrix calibration data fed in at quantization time — which is why a bartowski/…-GGUF repo (like the Nemotron one above) is a slightly different, usually better thing than a bare K-quant of the same weights.
jangq.ai claims impressive performance: at time of writing, JANG_2L at 82.5 GB scoring 74% MMLU against MLX 4-bit at 119.8 GB scoring 26.5% on MiniMax-M2.5; 397B-parameter models fitting on 128 GB Macs. The page also notes it has been “filtered to decisive smaller wins only” — close comparisons and unfavourable cases are not shown — so calibrate accordingly. At least one third-party benchmark is pretty impressed.
Osaurus is JANG native. Pull a model from JANGQ-AI and it loads. Elsewhere, support is partial: MLX Studio and vMLX natively, LM Studio / Ollama / Jan not yet. From Python: uv pip install “jang[mlx]”, then jang_tools.loader.load_jang_model(...).
7.2 Gotchas
The osaurus CLI is useful. If you do not install via bash it is not there.
7.3 MLX Studio
MLX Studio is the same author’s other Mac desktop app — Electron + Python rather than Swift, broader feature surface (image generation via Flux and Z-Image, ~26 built-in agentic tools, in-app GGUF→MLX and MLX→JANG conversion, an Anthropic-compatible API). It doesn’t quite land for me, at least not in comparison to Osaurus.
8 Antirez and DwarfStar
There is another weird Mac-only stack of interest to me: Salvatore Sanfilippo — antirez, the author of Redis — wrote some custom Apple Silicon inference code to run DeepSeek V4 Flash on a 128 GB MacBook, and a whole tiny supergroup of famed developers has grown up around it.
The approximate trajectory is as follows. April 2026: apparently moments after the DeepSeek V4 release, antirez drops antirez/llama.cpp-deepseek-v4-flash, a fork of llama.cpp with 2-bit quantisation, plus the matching GGUF at antirez/deepseek-v4-gguf.
A month later, he drops a from-scratch native Metal inference engine, ds4 (DwarfStar 4 to its friends) narrowly targeting DeepSeek V4 Flash and, I guess, a narrow family of derivatives. Targets M3 Max, M3 Ultra, and M5 Max specifically. Reported numbers are pretty snappy — ~14–15 tok/s decode at 62K context on an M3 Max 128 GB, ~450 tok/s prompt-processing on an M5 Max for a 10k-token codebase.
Like JANG, this is a small, specialised stack run by one guy; except, as we see below, there is a little more community buy-in.
8.1 Running DwarfStar via the pi stack
The default harness for ds4 seems to be: pi, an MIT-licensed agent harness by Mario Zechner (badlogic, of libGDX fame). There is an easy install via the pi extension by Armin Ronacher (mitsuhiko, of Flask): mitsuhiko/pi-ds4. It handles process management for ds4-server — per-PID leases, watchdog shutdown, OpenAI-compatible local endpoint on 127.0.0.1:8000:
First-time install clones antirez/ds4, builds it, downloads the GGUF (~87 GB), and registers a ds4/deepseek-v4-flash model with pi. Subsequent runs spawn the server on demand and shut it down when no client process holds a lease. OpenClaw embeds pi, so the same extension can in principle load there.
pi from the terminal opens a TUI (“textual user interface” I think that means, i.e. it lives in the terminal).
Audrey Tang maintains audreyt/pi-ds4, a fork that swaps in cyberneurova’s abliterated IQ2XXS quants and turns on uncertainty-mode directional steering by default — an activation-space edit that puts the model into “this is a contested question” mode on CCP-sensitive topics (Taiwan, Crimea, Kashmir, Western Sahara).
8.2 Manual setup for non-pi harnesses
Outside the pi ecosystem, the manual setup is four commands plus a config edit.
For lifecycle, we could wrap ./ds4-server in a launchd plist with KeepAlive: true; this is probably not what you want on a typical laptop where you do other things than inference, like, you know, use it as a laptop. I think pi is more automatic in that regard.
ds4-server’s context window is set at launch via --ctx <tokens> (max accepted per conversation) and --tokens <n> (default max output); --kv-cache <path> spills the KV store to disk for long-context runs, and --think turns on DeepSeek’s reasoning mode. DeepSeek V4 Flash nominally supports 1M tokens, but ds4 is RAM-bound: the 2-bit IQ2XXS weights are ~81 GB, and a full 1M-token KV/index sits around 26 GB on top. Rough budget on unified memory:
- 64 GB: 50k–150k
--ctxwith headroom. - 96 GB: 150k–250k works but is tight; quit Slack.
- 128 GB: 200k–300k is comfortable; >300k starts risking OOM.
- 1M: only with very generous memory and nothing else running.
If a client (Hermes, OpenClaw, OpenCode, anything OpenAI-compatible) advertises a context larger than --ctx, requests will get cut off — match the client’s contextWindow / limit.context to the server’s --ctx. DeepSeek’s sparse attention means raising --ctx doesn’t blow up compute the way dense attention would, but RAM is still the binding constraint. For most interactive coding, 32k–100k plus a retrieval layer beats brute-forcing the whole history into the prompt. See antirez/ds4’s README and the OpenClaw ds4 provider docs for the full flag list and client-side config.
Reasonable defaults:
To bake in audreyt’s directional-steering defaults:
To use with Hermes, add an OpenAI-compatible provider entry to the Hermes config (sketch — confirm the exact schema with hermes config):
From inside Hermes, /model dwarfstar:ds4/deepseek-v4-flash. Done.
Anyway, this gets us a generic token endpoint, so we’re free to plug in whatever on the front end.
The protagonists of this play have a lot of clout — antirez (Redis), mitsuhiko (Flask), badlogic (libGDX), and audreyt (Taiwan’s former Digital Minister, Pugs / Perl 6). Some kind of critical mass seems feasible for a certain type of nerd.
9 Memory management
We need to think about how much memory our machine has overall, and how much of that it will let us use for MLX workloads.
On the first point, TIL macOS’s “Memory Used” indicator does not measure how much RAM is committed in the way I assumed. It counts caching usage in some unproductive way. “Memory” — green / yellow / red in Activity Monitor, or Pages purgeable and Pages compressed from vm_stat — measures available RAM. macOS aggressively fills RAM with discardable file-cache pages. mactop is a handy resource monitor that doesn’t itself use too much memory.
On the second point, there are limits on how much any given process is allowed to take up of the precious system memory. macOS sets a hard limit on how much RAM Metal — and therefore MLX — is allowed to wire (lock into physically resident, GPU-accessible memory). The default is ~67% on Macs ≤36 GB and ~75% on larger ones. On a 128 GB Mac that means MLX refuses to allocate past ~96 GB, regardless of how much actually-free memory there is. Raise it at runtime:
This does not persist across reboots — we would need to wrap it in a LaunchDaemon or /etc/sysctl.conf entry to make it sticky.
Setting it to the full 128 GB is not wise. If MLX wires more than the OS can spare, the machine kernel-panics.
mlx_lm.server and (via its inheritance) vmlx-swift-lm — so Osaurus — call mx.set_wired_limit() on startup. The runtime defaults to 112 GB on 128 GB machines.
But also, before launching a big run:
sudo purgeflushes the file cache so the OS has clean room to allocate. Available RAM jumps; subsequent file I/O is slower until the cache refills.- Quit Electron apps. Slack, Discord, Cursor, VS Code, Chrome will routinely pin 4–8 GB each.
MLX_LM_CACHE_LIMIT=0(env var) prevents MLX’s internal allocation cache from growing unboundedly during long sessions — useful for sustained embedding or agent workloads.
10 Let’s break things
I came to understand the transformers/llama.cpp split by breaking it.
The Hugging Face and Ollama versions of mxbai-embed-large are nominally the same model — same upstream weights — but each stack implemented its own tokenizer. On plain prose the two agree mostly I think; on markdown they can disagree by a few percent on how many tokens a chunk takes. Best not to mix and match. For embeddings on this blog I went all-transformers. If I were keeping a live server and interacting with it, Ollama is probably wiser.
11 Excluding model dirs from background services
The model weights are enormous and waste space in backups. There is no point
- backing up a quantised
.ggufwe can pull again in two commands, nor - indexing
.safetensorsfiles for Spotlight — they are opaque binary blobs and Spotlight will spin happily for hours grinding nothing useful out of them.
Solution!
# One list, two background services to opt out of
model_dirs=(
~/.cache/huggingface
~/.cache/uv
~/.ollama/models
~/.lmstudio
"$HOME/Library/Application Support/Jan/data/llamacpp/models"
"$HOME/Library/Application Support/Jan/data/mlx/models"
~/MLXModels # Osaurus + MLX Studio default
)
# Time Machine — sticky exclusion keyed to the path string
for d in "${model_dirs[@]}"; do
[ -d "$d" ] && sudo tmutil addexclusion -p "$d"
done
# Spotlight — drop the Apple-documented marker file in each directory
for d in "${model_dirs[@]}"; do
[ -d "$d" ] && touch "$d/.metadata_never_index"
done
# Confirm a few
tmutil isexcluded ~/.cache/huggingface
ls -la ~/MLXModels/.metadata_never_index.metadata_never_index is the Apple-supported marker file that tells mds_stores to skip the directory and everything under it; the file is empty and the marker is the filename.
If we ever want to re-index a directory (model dir promoted to “actual content”), rm .metadata_never_index and mdimport -r <dir> puts it back.
12 See also
- antirez’s “DeepSeek-V4-Flash on a MacBook M5 Max” — the demo video for the DwarfStar section.
earendil-works/pi— the general-purpose agent harness used in the DwarfStar pi section.- SOV phases-apple — a log of my exploration of the agentic-coding-on-Apple-Silicon stack: model picks per RAM tier, LiteLLM routing, MCP, the whole thing.
- Front-end clients for AI image models — the image-generation analogue of this page.
- AI democratization — the community and politics around open weights, open data etc.
- Edge ML — the broader question of running ML on small hardware.
