Running LLMs locally on a Mac

Osaurus, Ollama, JANG, DwarfStar, and the transformers path

2026-05-23 — 2026-07-02

Wherein the Mac’s Metal Wired-Memory Limit Is Encountered, and a Survey of Competing Runtimes, Quantisation Formats, and Headless Server Daemons Is Conducted.

computers are awful

machine learning

neural nets

NLP

A twin post to front-end clients for AI image models, but for text. The local-LLM ecosystem on Macs is pretty luxurious, with a profusion of GUI options and Linux-y infra, and some specialised tooling that lags the community frontier but is not bad. Also, during the 2026 rmageddon, Macs suddenly look like remarkably good deals for high-RAM parallel-compute machines. I accidentally started going unreasonably deep and technical on this in the SOV repo. That repo really targets my coding assistant. Here is a human-facing version.

tl;dr LLMs are capable and useful on modern laptops. The trick is not to waste a month overthinking the damn thing and to just go, if you plan to harvest more value than you sink in tinkering.

1 The stack

I assume we are familiar with the following terms: model, quantization format, runtime / inference engine, server / daemon, harness / agent loop, frontend / chat client

The main Mac-centric tools as far as I’m concerned are:

Runtime: llama.cpp, MLX, mlx-lm, vmlx-swift-lm, PyTorch-on-MPS via transformers, antirez’s ds4.
Server: ollama serve, llama-server, mlx_lm.server, Osaurus, vllm-mlx, oMLX, ds4-server
Harness: Osaurus and Jan both have one built in; pi is a popular standalone one.
Frontend: Osaurus’s chat window, Jan, LM Studio, MLX Studio, Open WebUI, …

Many apps on this page are vertical bundles across several of those layers — Osaurus is frontend + harness + server + runtime in one, Ollama is server + runtime. Increasingly, clients ship their own miniature copy of llama.cpp as a bonus feature. I find these annoying as they tend to fight with one another and waste disk space/VRAM, but it is OK for intermittent/unserious use. Anyway, it pays to know which layer we’re looking at when an abstraction leaks.

1.1 Compute backends

The compute backend is the runtime that runs the matmuls — where on the chip the work actually happens. Three of them cover local text inference on the M-series, and which one a tool picks sets both its speed and how soon it handles a new model.

PyTorch + MPS (Metal Performance Shaders) is the baseline. Most ML code reaches Apple Silicon through PyTorch, so coverage of new architectures arrives first — it is the lingua franca. Our embedding code runs here; speed is decent, not amazing.
llama.cpp carries its own hand-written Metal kernels rather than going through MPS. It is the engine under Ollama and llama-server, fast and wide-coverage, and the one that eats GGUF.
MLX is Apple’s own array framework — faster still on the work it covers, less mainstream, and lagging by months on new architectures. Osaurus, mlx-lm, and the JANG stack all use it.

Not treated here because they are only relevant to image models: CoreML on the Neural Engine for the lowest-footprint path, and Draw Things’ custom Swift + Metal stack.

1.2 Storage backends

The storage backend is the on-disk format the weights ship in — subject to whatever the quantization (if any) of the weights. RAM is usually the main constraint on consumer hardware, which is why quantization is helpful. A 70B model at full fp16 is ~140 GB and simply will not fit on even a 128 GB Mac. A 4-bit build of that same 70B lands near 40 GB, which is tolerable. There are several formats in play.

safetensors — the Hugging Face baseline, full precision; what PyTorch + MPS loads when the model fits without help.
GGUF — the llama.cpp format. Not Apple-specific (it runs on CUDA and CPU too), it has the widest coverage and the finest quant ladder — down to the very-low-bit IQ2/IQ3 imatrix quants.
MLX (mlx-community) — the MLX-native format, Apple-only. Also quantized to fit but with added speed on the M-series; the quant range is coarser, and coverage tends to lag.
JANG — mixed-precision extension to MLX: per-tensor bit-widths instead of one width for the whole model.

Generally we prefer an MLX build for the speed when one is published; GGUF when no MLX port exists yet, and classic safetensors when neither is available.

The fixed width in MLX is indeed sus though — worth verifying that MLX models work OK. I suspect that JANG solves this better, but it is even more fringe.

Where weights live

Almost everything here pulls from Hugging Face, in true open-source anarchic style, they all stash weights in different places, so we rapidly end up with many copies of everything. There are three classes of storage AFAICS.

The org/repo resolvers — transformers, mlx-lm — share the one cache at ~/.cache/huggingface/hub. Name a repo, the associated weights land there once, and every cache-resolving tool reuses it. I think this gets us a long way.

The directory-scanning servers — Osaurus, oMLX, LM Studio, MLX Studio — are more chaos. Each wants a folder of model subdirectories and each has its own default location (~/MLXModels, oMLX’s --model-dir, ~/.lmstudio, ~/.mlxstudio/models) and each writes a fresh copy outside the HF cache. We can make huggingface do so too if we force it — hf download --local-dir <dir>/<name> Thus the same giant blob of neural networks can in up in 4 or more places, and be downloaded as many times. We can presumably make this better by picking one folder and point every such server at it: ~/MLXModelsseems ok to me and is Osaurus’s default, so we can tell them all to use that: omlx serve --model-dir ~/MLXModels does that. We can explicitly set it for Osaurus too with OSU_MODELS_DIR=~/MLXModels (but like I said, it’s the default anyway). oMLX will also reuse ~/.lmstudio directly if LM Studio is already our downloader.

Inside that folder, mirror each model’s org/repo path — hf download $repo --local-dir ~/MLXModels/$repo — because that is the layout Osaurus’s and LM Studio’s own downloaders use (OsaurusAI/…, mlx-community/…). Hand-pulls then land beside theirs instead of as flat one-offs the next tool fails to recognise and re-downloads, and two orgs with the same model name stop colliding. The servers recurse into the org subdirs by themselves, so there is nothing to flatten.

Ollama is a whole ’nother thing: it has its own registry, own blob store, own model names — ollama pull qwen3 pulls a new ollama Qwen3, not whichever one exists on HF. ollama run hf.co/<org>/<repo> pulls from HF but then repacks it into Ollama’s store — a copy, naturally.

2 Desktop apps

The fastest path from zero to local LLM is a desktop app: one download gives us a model browser, a chat window, and an inference engine. Each is a vertical bundle — a frontend, its own runtime, usually with a server and an agent harness folded in. None of them are wholly satisfactory IMO but they are all pretty usable. If you can avoid trying to squeeze extra performance out, probably any of them will do.

Sometimes I want the server without bells and whistles and graphics instead.

2.1 Osaurus

Osaurus (MIT, brew install --cask osaurus, osaurus-ai/osaurus) is Swift-native, no Electron, no Python, and behaves like a proper Mac app.

I’m YOLOing all in on this because it seems efficient and easy. It locks me in to the Mac ecosystem so might not be for everyone. Also, it’s run by one person, so the bus factor is 1, which is a very small number. But — it’s so good!

The window has a model picker, a chat pane, a status indicator; the inference engine underneath is Apple’s fast MLX, so it gets many tokens per second. It is also the intended runtime for a custom mixed-precision MLX quantization format called JANG.

Osaurus is not just a chat client but a full native macOS agent harness. It supports various hip features like persistent memory, sandboxed working folders in an isolated Linux VM via Apple’s Containerization framework… It understands agentskills.io-format skills (and whole Claude plugins) from GitHub or local files, selecting them by RAG at runtime, and speaks MCP in both directions, server and client. The harness layer is model-agnostic, fronting cloud providers as happily as the local MLX-ish runtime.

There is no CLI download command; the in-app Model Manager (⌘⇧M → Models) browses a curated catalogue of models, especially JANG ones, and will sideload others too — though not all of them work equally well. Nemotron 3 Nano Omni 30B A3B JANGTQ4 seems like a reliable workaday default. Osaurus also discovers anything dropped into its models directory:

hf download mlx-community/DeepSeek-R1-0528-Qwen3-8B-4bit \
  --local-dir ~/MLXModels/mlx-community/DeepSeek-R1-0528-Qwen3-8B-4bit   # mirror the org/repo path
osaurus list    # confirms discovery, and gives the exact API name for `osaurus run`

Osaurus is also a first-class server, covered below.

2.2 Jan

Jan (brew install --cask jan) is a FOSS cross-platform option — a full frontend + harness + server + runtime bundle. The UI is built on the mildly cursed Electron, but the plus side is that it runs on Linux, Windows, and macOS. It supports both llama.cpp (via Cortex) and MLX backends, so for brute-force running non-Apple-optimised models it seems to go.

It looks nice, and the Projects / Assistants / Agents / MCP Connectors quartet gives it a tool-calling agent loop — connect MCP servers under Settings, and Agents mode runs multi-step autonomous workflows (Jan v2 VL is pitched as a 49-step multimodal agent). Jan Server is the self-hosted orchestration variant.

2.3 LM Studio

LM Studio (brew install --cask lm-studio) is closed-source, relatively slick and turnkey, and not free for commercial use. It runs both llama.cpp and lately its own MIT-licensed mlx-engine (mlx-lm + Outlines + mlx-vlm). Like the others it is a bundle — frontend + runtime + server — with an OpenAI endpoint it can expose headlessly (lms server). I’m mildly sceptical of it because so many Cool LLM Technologies ship special bug fixes or alternate install paths for LM Studio, which hints at a slightly non-standard stack — though that might just be sampling bias, since more people file bug reports when more people run the thing. Also the licence sucks.

These three are general-purpose chat apps. To drive a local model as a coding agent instead — terminal harnesses like OpenCode or Aider, VS Code sidebars like Cline — see Code agents and assistants, which points back here for the local backend they run against.

3 Serving a model headless

Once we want a model serving as a daemon (“token fountain” we say at my work) rather than a chat window — a code editor, an embedding pipeline, a script that calls out to a local model — we want a long-lived process with an OpenAI-compatible API. The desktop apps above mostly do this already; below are the headless-first stacks, plus Osaurus’s daemon face.

These stacks differ in how they handle the model lifecycle: how many models stay resident at once, and what switching between them costs.

3.1 Osaurus as a server

If we already have Osaurus running we are mostly done: it is already that daemon, exposing OpenAI-, Anthropic-, and Ollama-compatible endpoints on localhost:1337 all at once (some deep magic). Anything we want to point at a local model can talk to it. But also, for reasons of not installing an idiosyncratic stack, or superior customisation, we might want to install the standard Linux server stack.

To keep two models resident at once (say the agentic daily-driver plus the maths model), set Settings → Local Inference → Model Management to Flexible — under the default Strict policy, loading one evicts the other.

Context length is automatic — Osaurus picks a sane per-model default and does not expose it as a plain setting, so the actual ceiling is hard to read off (the cheat-sheet has the detail).

To drive all this from the terminal there is one gotcha: the osaurus command is embedded in the app bundle, and only the Homebrew install links it onto PATH automatically. If it is missing, symlink it: ln -sf “/Applications/osaurus.app/Contents/Helpers/osaurus” “$(brew --prefix)/bin/osaurus” or use the special button from the settings menu¹ The CLI supports stuff like osaurus serve / stop / status / list / run <model> / mcp, plus a plugin manager.

3.2 Ollama

Ollama (brew install ollama) is a llama.cpp wrapper with its own model registry — fast enough, wide model coverage, and notably good for embedding models:

brew services start ollama
ollama pull qwen3.5  # LLM/chats etc
ollama pull mxbai-embed-large    # also handles embeddings

Anything OpenAI-API-compatible can now point at http://localhost:11434/v1. Ollama is the hands-off one on lifecycle: it loads a model on first request, keeps several resident at once (up to OLLAMA_MAX_LOADED_MODELS, default 3), and unloads each after OLLAMA_KEEP_ALIVE of idleness (default 5m). Left unbounded the pool tanks the machine — OLLAMA_MAX_LOADED_MODELS=1 forces evict-on-switch, OLLAMA_KEEP_ALIVE=0 drops a model the moment it idles (or 15m to keep it warm longer). Context window is num_ctx: set it per request (options.num_ctx), bake it into a Modelfile (PARAMETER num_ctx), or lean on the OLLAMA_CONTEXT_LENGTH default — which on recent Ollama auto-scales to VRAM (4k / 32k / 256k) rather than the old fixed 2048.

Gotchas:

The .gguf files come from Ollama’s registry, not Hugging Face.
Some weird reimplementation headaches — e.g. the tokenizer baked into the GGUF can differ from the original for unclear reasons.
The “llama.cpp wrapper” framing is loosening: the registry now ships -mlx tags for some models (e.g. qwen3.5:35b-mlx).

3.3 mlx-lm and mlx-vlm

mlx-lm is Apple’s reference language-model runtime on MLX. mlx-vlm is its sibling package: same MLX backend, same mlx-community/<repo> weights and HF cache, but for VLMs (“vision-language models”) and omni models — image/video/audio in, text out — instead of pure text LLMs. Where this page says “VLM” it means a model in that family; mlx-vlm is what runs one locally, e.g. DeepSeek-OCR. Osaurus and JANG are inspired by MLX-type Apple-Silicon-friendly execution, but mlx-lm is the OG.

uv tool install mlx-lm drops a family of commands onto PATH, all reading the same weights:

mlx_lm.generate --model mlx-community/<repo> — one-shot completion from the CLI.
mlx_lm.chat — an interactive REPL in the terminal.
mlx_lm.server --model mlx-community/<repo> — an OpenAI-compatible daemon; holds one model, swapping on demand per request (evict + reload, not a restart). Two live at once means two processes on two ports.
mlx_lm.lora — its LoRA fine-tuning path.

mlx-vlm mirrors this shape (mlx_vlm.generate, mlx_vlm.chat, mlx_vlm.server) but takes an image/video/audio argument alongside the text prompt.

It uses the standard HuggingFace links: mlx-community/<repo> resolves straight to Hugging Face, and the weights land in the shared HF cache (~/.cache/huggingface/hub).

mlx_lm.server has no --ctx flag, so it grows the KV cache to fit whatever we send — up to, presumably, the model’s declared max, capped only by RAM, at which point it presumably kernel-panics the machine. Cap the context in the harness (limit.context / contextWindow) and mind the memory budget.

A reason to keep this around even with Osaurus installed is that Osaurus is great when it runs, but its Swift engine’s coverage lags the Python MLX options. A plain mlx-lm loads interesting MLX conversions that Osaurus can’t (e.g. Cascade-2).

3.4 vllm-mlx

vllm-mlx (uv tool install vllm-mlx, Apache 2.0) is a vLLM-style inference server for Apple Silicon. Active and popular by this page’s standards — 1,300+ stars. Core pitch: continuous batching, paged KV cache with prefix sharing, an SSD-tiered cache for spilling prefixes to disk, and both OpenAI (/v1/*) and Anthropic (/v1/messages) endpoints from one process. Many fancy bonus features: native TTS (Kokoro, Chatterbox, VibeVoice, VoxCPM) and STT alongside text/image/video/audio chat, speculative decoding (--mtp) for Qwen3-Next, MoE expert-count reduction (--moe-top-k, a claimed 7–16% speedup on Qwen3-30B-A3B), Prometheus metrics (--metrics), and a built-in benchmarker (vllm-mlx bench-serve). Supports multi-model residency via a --models-config models.yaml registry: named models behind one process, lazy load on first use, LRU eviction under a memory_budget_gb, and a contention_policy (fail / wait / preempt / wait_then_fail / wait_then_preempt) for what happens when a request needs a model that does not currently fit; clients pick one via the normal OpenAI model field.

uv tool install vllm-mlx
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching

For a chat window over this endpoint — one that renders the equations the maths models emit — Open WebUI points at it unmodified: add http://localhost:8000/v1 as an OpenAI connection and skip the bundled-Ollama path the tutorials assume.

3.4.1 Serving from `~/MLXModels`

vllm-mlx has no folder-scanning flag. A bare local path loads one model the same way a bare mlx-community/<repo> id does:

vllm-mlx serve ~/MLXModels/mlx-community/Qwen3.6-35B-A3B-4bit --port 8000 --continuous-batching

Multi-model residency is via a --models-config models.yaml registry: named entries, each with an explicit path:.

manager:
  memory_budget_gb: 100
  contention_policy:
    strategy: wait_then_preempt
    wait_timeout_s: 45
    preempt_after_s: 15

models:
  - name: driver
    path: /Users/dan/MLXModels/mlx-community/Qwen3.6-35B-A3B-4bit
    continuous_batching: true
    estimated_memory_gb: 22

  - name: solver
    path: /Users/dan/MLXModels/mlx-community/VibeThinker-3B-8bit
    preload: true
    estimated_memory_gb: 3

We can run the server like so:

vllm-mlx serve --models-config ~/.config/vllm-mlx/models.yaml \
  --port 8000 \
  --gpu-memory-utilization 0.88 \      # hard process ceiling ≈112 GB (0.88 × 128)
  --continuous-batching \
  --use-paged-cache \
  --cache-memory-mb 30720 \
  --max-cache-blocks 16384 \
  --max-num-seqs 16 \
  --max-tokens 131072 \
  --max-request-tokens 131072 \
  --enable-auto-tool-choice \
  --tool-call-parser auto \
  --reasoning-parser qwen3 \
  --enable-metrics \
  --kv-cache-quantization \
  --kv-cache-quantization-bits 8 \
  --timeout 1200

This is a chunky-boi config, allocating a hundred gigs of memory to get large context windows, and setting long timeouts so I can flood the server without too much guilt.

Notes

path: needs to be a real filesystem path — YAML does not ~-expand.
estimated_memory_gb is mandatory on a bare HF id — the manager needs some number to make eviction decisions from — but optional on a local path with real weight files on disk, since those can be measured directly.
--cache-memory-mb 30720
--tool-call-parser auto: mixed Nemotron/Qwen registry needs ‘auto’
--max-num-seqs: concurrency cap (default 256 would explode KV)
--max-cache-blocks 16384: 30 GB KV pool (was 80 — the OOM risk)
--kv-cache-quantization-bits 8/--kv-cache-quantization: downsample KV cache for longer prompts without exploding RAM
--tool-call-parser auto: models have different tool-call formats, so we try each one per response

Clients pick a model with the normal OpenAI model field (model: "driver", model: "solver"), the same pattern as oMLX’s pinned pair.

One multimodal gotcha: a VLM or omni model needs a mllm: true on its registry entry to load (the standalone single-model serve equivalent is the global --mllm flag). vllm-mlx guesses multimodality from the repo name — VL, vision, llava and friends — but -Omni- slips through, so a model it reads as text-only dies at weight-load on the vision and audio towers it has no slots for (Received N parameters not in model). mllm: true routes that one entry through mlx-vlm instead, without forcing the text models in the same registry down the same path. The weights also have to carry a config the mlx-vlm loader recognizes: the mlx-community Nemotron-3 Nano Omni builds load, but an Osaurus repackaging of the same model that hides the multimodal config in a side-file does not.

Memory handling is messy. The registry manager decides which models to keep resident from — manager.memory_budget_gb — which counts model weights only and is configured in the YAML, with no command-line override. It never looks at the ceiling we set on the command line (--gpu-memory-utilization, --cache-memory-mb); it evicts against its own budget and trusts that whatever it keeps resident will fit. So if the budget is higher than the ceiling can hold once the KV cache and activations are counted, the manager keeps two models resident because its arithmetic says they fit, and then MLX hits the ceiling and the process dies — a hard out-of-memory crash instead of a graceful eviction. So for now memory_budget_gb ≤ gpu-memory-utilization × RAM − cache-memory-mb − headroom, and create a different config file for each memory allocation I guess? I filed a bug report about that.

3.5 oMLX

oMLX (jundot/omlx, Apache 2.0, brew tap jundot/omlx https://github.com/jundot/omlx && brew install omlx) is vllm-mlx forked and grown a different frontend, so it inherits that feature set — continuous batching, multi-model residency, both OpenAI and Anthropic endpoints. Three things set it apart:

Its SSD prefix cache is automatic and block-addressed — hot blocks in RAM, cold blocks spilled to disk, longest-prefix matched and surviving a restart — where llama-server’s --slot-save-path and ds4’s --kv-disk-dir are manual slot-save knobs. Aimed at agentic coding, where the pitch is TTFT dropping from 30–90s to under 5s on the second turn of a long context.
An explicit Claude Code accommodation: it rescales reported token counts so auto-compact fires at the right time, and holds the connection open with SSE keep-alives through a long prefill. The frontend is a signed SwiftUI menu-bar app (not Electron) with a web admin panel, and it reuses an existing LM Studio model directory.
It is the one non-Osaurus server with merged JANG support, so it can load the mixed-precision JANG quants that make sub-4-bit MoE models behave — otherwise Osaurus-only.

omlx serve --model-dir ~/MLXModels \
  --paged-ssd-cache-dir ~/.omlx/cache \
  --hot-cache-max-size 16GB \
  --max-concurrent-requests 8

Same caveats as the rest of the page (bus factor 1, MLX-only, benchmarks from an M3 Ultra 512GB), but the automatic SSD KV cache and clean lineage make it worth a run as the headless daily-driver if the Osaurus/mlx_lm.server pair leaves us wanting persistent prefix reuse.

3.5.1 Example multi-model setup

The mathematical fan-out setup wants both models live at once on one endpoint — a solver sampled wide for maj@k, orchestrated by an agentic driver. oMLX does this in a single process; the config is a model directory, a memory ceiling, and a pin per model.

Drop the weights into the shared MLX dir — ~/MLXModels, the same tree Osaurus scans, so one download serves both — mirroring each repo’s org/name path the way the GUI downloaders do:

hf download gabfssilva/VibeThinker-3B-MLX-BF16  --local-dir ~/MLXModels/gabfssilva/VibeThinker-3B-MLX-BF16  # solver, hi-fi bf16
hf download mlx-community/VibeThinker-3B-8bit   --local-dir ~/MLXModels/mlx-community/VibeThinker-3B-8bit   # solver, near-lossless and faster
hf download mlx-community/Qwen3.6-35B-A3B-4bit  --local-dir ~/MLXModels/mlx-community/Qwen3.6-35B-A3B-4bit  # the driver

Launch with a memory guard sized to hold the driver, the solver, and the fan-out’s KV all at once, and the concurrency raised to the $k$ samples we mean to run:

omlx serve --model-dir ~/MLXModels \
  --memory-guard-gb 110 \          # 128 GB Mac: driver (~22 GB) + solver (~3 GB) + KV headroom
  --max-concurrent-requests 16     # the fan-out width; default is 8

In the admin panel, pin both the driver and the solver so LRU does not evict one while the other is mid-job, and set their sampling as per-model profiles — vibethinker:solve at temp 1.0 / top-p 0.95 / top-k 0, the driver at its own recipe (Qwen3.6 thinking temp 0.6 for coding). The fan-out then POSTs model=vibethinker:solve $k$ times while the loop drives model=qwen3.6 — same port 8000, both resident, no reload between them.

On the two VibeThinker builds, both load and serve fine. The 8-bit decodes ~80% faster than bf16 but the bf16 might be worth it for tiebreaker votes.

3.6 llama.cpp / llama-server

llama.cpp ships its own server (brew install llama.cpp): llama-server -m model.gguf exposes an OpenAI-compatible endpoint with no daemon, no registry, no opinions.

Ollama wraps this same engine. However, it is more configurable, exposing llama.cpp flags that Ollama does not, notably: YaRN context extension on a GGUF (which Ollama cannot do at all), the finer KV-cache quant ladder, speculative decoding with a draft model, and per-slot KV save/restore to disk.

It loads a .gguf from disk — no ollama pull into a separate store, no background service — which suits scripted or reproducible runs.

It loads the one model named at launch, though a newer router mode (start it with no -m) does dynamic multi-model load and unload. Context is the -c / --ctx-size flag, defaulting to 0 (the model’s full trained window). Other useful config options can be found in the cheat-sheet.

3.7 Sampling defaults

Sampling — temperature, top-p, top-k, and the output-token budget — is a decode-time choice set on each request. In particular the server does not use the model’s recommended sampling settings automatically.

mlx_lm.server, for example, defaults to temperature=0.0 (greedy), top_p=1.0, top_k=0, and max_tokens=512, and it does not read the model’s generation_config.json. So a reasoning model’s advised settings — VibeThinker wants temperature 1.0 / top-p 0.95 and a 64K-plus output budget — need the client to specify them; each model table below carries a Recommended sampling column with the per-model picks. Notably temperature=0.0 breaks maj@k voting, since every sample comes back identical.

Where to pin the values depends on which layer issues the prompt.

mlx_lm.generate — flags per call: --temp 1.0 --top-p 0.95 --max-tokens 40000 (--top-k already defaults to 0). Wrap it in an alias.
mlx_lm.server — launch-time defaults via --temp / --top-p / --top-k / --min-p, overridden per request; Osaurus keeps the same defaults in its app settings, also overridden per request.
transformers — a GenerationConfig passed at generate() time.
Ollama — is an exception, baking in a per-model default through a Modelfile: PARAMETER temperature 1.0, PARAMETER top_p 0.95, PARAMETER num_predict 40000 for the output budget, PARAMETER num_ctx 65536 for the context window.

Output budget and context window are separate limits: the first caps how much the model may emit, the second how much prompt-plus-output the KV cache holds. A long-reasoning model needs both raised, or it truncates mid-derivation — and Ollama in particular drops the overflow silently once num_ctx is exceeded.

3.8 Prefix caching

The KV cache is the only state these servers keep between requests — there is no session object, we resend the whole history each call. What they reuse is the prefix: the shared start of the conversation (system prompt, tool definitions, history so far) keeps its KV, so an agentic loop prefills only the new suffix. Matching is content-addressed — on the tokens, not a session ID — so it happens by itself. It is likely not optimal in general.

When planning around this we need to be aware that trimming the front of the history will break this caching and require everything to be re-computed.

Different servers clear this cache at different times:

mlx_lm.server — an in-memory LRU, longest-prefix matched, reporting cached_tokens in the usage block; bounded by --prompt-cache-size / --prompt-cache-bytes, held for the process.
llama-server — --cache-prompt is on by default (one KV per slot; --cache-reuse even salvages chunks after a mid-prompt edit), and slots can be saved to disk.
Ollama — the same llama.cpp reuse, alive as long as the model stays loaded (OLLAMA_KEEP_ALIVE).
Osaurus — automatic; headless under osaurus serve the cache lives for the server process (governed by the Strict/Flexible policy), and in the GUI it is per chat window, warmed the moment one opens.

Prefill is chunked and continuously batched besides, so concurrent requests interleave rather than queue — but the prefix skip is the bigger win.

3.9 Stretching the context window

A model’s positional encoding is trained out to some fixed length, and that trained length is the context window the weights actually know. Some fancy huge numbers that get advertised are that window stretched at load time, not a property of the weights. Qwen3.6, for example, trains its rotary positions to 256K (262,144 tokens); the 1M figure quoted for it is that same window extended ~4×, and getting the extension costs us something in both quality and RAM.

The mechanism is RoPE interpolation. RoPE encodes each token’s position as a rotation; interpolation rescales those rotations so a position past the trained window maps back into the range the model saw during training, instead of falling off the end into rotations it has never seen. The common variant is YaRN (“yet another RoPE extension”), which rescales per frequency rather than uniformly. A factor of 2.0 takes Qwen3.6’s 256K to ~512K, 4.0 to ~1M.

Every mainstream implementation AFAICT applies the rescaling statically, fixing it at load and applying it to every prompt regardless of length. A model loaded with factor 4.0 rescales a 2K-token prompt exactly as hard as a 900K one, and short prompts lose some accuracy for a long window they aren’t using. So we switch YaRN on only when we want the long window, and set factor to the longest context we actually expect rather than the largest the model will accept.

How to turn it on depends on the runtime.

transformers and mlx-lm read a rope_scaling block straight from the config.json that ships in the model’s own directory:

"rope_scaling": {
  "rope_type": "yarn",
  "factor": 4.0,
  "original_max_position_embeddings": 262144
}

llama.cpp / llama-server take flags: --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 262144, alongside the usual -c.
Ollama exposes no YaRN settings. We inherit whatever the person who converted the GGUF baked into its metadata.
Osaurus’s Swift engine has the YaRN code, but its Qwen3 and Llama wrappers do not route through it, so it is unavailable for now.

YaRN increases the maximum context window, but does not shrink the KV cache cost of storing that context, so we are still RAM-constrained.

3.10 Configuring each server

All the settings in one place.

Server	Context cap	KV-cache quant	Prefix cache	Extend context (YaRN)	Sampling defaults
`llama-server`	`-c N` (`-c 0` = model max)	`--cache-type-k/v q8_0` (also `q4_0`, `q5_0`, `iq4_nl`)	on by default; `--cache-reuse N` after edits, `--slot-save-path` to disk	`--rope-scaling yarn --rope-scale N --yarn-orig-ctx N`	CLI (`--temp`, `--top-k`, …) + per request
Ollama	`num_ctx` / `OLLAMA_CONTEXT_LENGTH` (auto 4k/32k/256k by VRAM)	`OLLAMA_KV_CACHE_TYPE=q8_0` (needs flash attn)	automatic, lives while loaded (`OLLAMA_KEEP_ALIVE`)	none — inherits the GGUF	Modelfile `PARAMETER` + per request
`mlx_lm.server`	none — grows to RAM, cap in the harness	none (only `mlx_lm.generate --kv-bits`)	automatic (`--prompt-cache-size` / `--prompt-cache-bytes`)	`config.json` only	CLI (`--temp`, `--top-p`, …) + per request
Osaurus	auto per-model (no global setting)	none exposed (vmlx defaults)	automatic	wrappers don’t route it (above)	Settings default + per request
oMLX	cap in the harness; memory guard via `--memory-guard-gb`	none exposed	persistent two-tier — RAM hot + SSD cold (`--paged-ssd-cache-dir`), survives restart	`config.json` only	admin panel per-model + per request
`ds4-server`	`--ctx N`; output cap via the API	fixed by the model variant, not settable	in-memory reuse, durable via `--kv-disk-dir`	n/a (single model)	per request

Flash attention has no column because it has stopped being a per-server decision: llama.cpp defaults -fa to auto and enables it wherever Metal supports it, Ollama switches it on per-architecture for the families here (Qwen3.x, Nemotron, Gemma, gpt-oss), and the MLX servers and ds4 always run a fused attention kernel. The one place it still needs a hand is Ollama’s KV-cache quant, which only takes effect with OLLAMA_FLASH_ATTENTION=1 set alongside it.

Sensible headless starting points:

# 64K context (below): a useful bound well under the trained max — raise or lower for your RAM
# llama.cpp — flash attention is automatic; quantize the cache, choose a context
llama-server -m model.gguf -c 65536 --cache-type-k q8_0 --cache-type-v q8_0

# Ollama — env vars; cache quant needs flash attention enabled
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_CONTEXT_LENGTH=65536 ollama serve

# mlx-lm — no cache-quant/context flag; set the model's sampling values, cap context in the harness
mlx_lm.server --model mlx-community/<repo> --temp 0.6 --top-p 0.95 --top-k 20

# oMLX — point it at the shared MLX tree; turn on the SSD cold tier and a memory ceiling
omlx serve --model-dir ~/MLXModels --paged-ssd-cache-dir ~/.omlx/cache --memory-guard-gb 96

The --paged-ssd-cache-dir on that last line persists across restarts and is rewritten every session, so it is the one path here that wants keeping out of backups and Spotlight.

Osaurus is tuned in its Settings pane rather than on a command line, and ds4’s full flag set is in its own section.

4 Programmatic access via `transformers`

When we want to do things to a model — embed text, fine-tune, run interpretability tools, sample from internal layers, anything that touches the model internals — we drop down to Hugging Face transformers in our own Python process; no server process, no HTTP API, just direct access to the calculations.

Embeddings for search are why I currently do this. For this I want sentence-transformers (uv pip install sentence-transformers — pulls torch with it), a thin wrapper around transformers that exposes the embedding API:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
embs = model.encode(texts, normalize_embeddings=True, convert_to_numpy=True)

The first call downloads the weights, the tokenizer config (tokenizer.json), and the model config from huggingface.co into ~/.cache/huggingface/hub/. Inference runs in our process via PyTorch. Tokenization runs in the same process, via HF’s Rust tokenizers library reading the same tokenizer.json the model was published with. One process, one library, one set of files.

On Apple Silicon we get a 15× speedup over fp32 by switching to float16 on MPS, with indistinguishable quality:

import torch
gpu = torch.cuda.is_available() or torch.backends.mps.is_available()
kwargs = {"model_kwargs": {"dtype": torch.float16}} if gpu else {}
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1", **kwargs)

For text generation (rather than embedding) the equivalent is AutoModelForCausalLM.from_pretrained(...). PyTorch is rarely the fastest path on Apple Silicon — llama.cpp and MLX usually win on tokens-per-second — but it is the path that lets us see what the model is doing without arsing around. Activations, attention patterns, hidden states, custom sampling etc. are all possible from the Python prompt.

5 Fun models

Once we have a stack running, the next question is what to pull through it. A non-exhaustive list of picks I have been playing with are below.

5.1 For mathematical reasoning

Two kinds here, split by type from the theory page. Generalists reason about maths alongside chat and tool-use; solvers trade that away for raw maths and lean on specialised harnesses — tool-integrated reasoning, maj@k voting, cloud fan-out.

Model	Type	Size	Sampling	Run via	Why
DeepSeek-R1-0528-Qwen3-8B	generalist	~5 GB	temp 0.6 / top-p 0.95, ≥64K out	MLX / Ollama GGUF	Small-model maths champion — AIME-2024 86%, matching Qwen3-235B-thinking on that benchmark. The one to beat.
Phi-4-Reasoning-Plus-14B	generalist	~8 GB	temp 0.8 / top-p 0.95 / top-k 50; wants a ChatML system prompt	MLX / GGUF	A different reasoning-trace style for triangulating DeepSeek — not a stronger model.
Nemotron-Cascade-2-30B-A3B	generalist	~24 GB	temp 1.0 / top-p 0.95	Ollama	NVIDIA’s IMO-2025-gold model — a Mamba (SSM) + MoE hybrid for linear context scaling. Doesn’t work in Osaurus.
OpenMath-Nemotron-14B	solver	~8 GB	temp 0.6 / top-p 0.95, sample	MLX / GGUF	The sweet-spot solver — ~the AIMO-2-winning 32B’s score at half the RAM; tool mode wants NeMo-Skills.
Qwen2.5-Math-72B-Instruct	solver	~40 GB	greedy (`do_sample=false`)	MLX 8-bit / GGUF	Push-button tool mode — Qwen-Agent drives its code loop, no extra infra.
Skywork-OR1-Math-7B	solver	7B	temp 0.6 / top-p 1.0, 32K out	GGUF	Best small pure-reasoning solver — AIME-2024 69.8 at 7B, DeepSeek-R1-based.
VibeThinker-3B	solver	~3 GB 8-bit / ~6 GB bf16	temp 1.0 / top-p 0.95, 64K out (→100K hard)	MLX 8-bit (fan-out default) or bf16	Weibo’s 3B verifiable-reasoning solver (MIT) — strong olympiad scores but solver-only, thin world knowledge.

These pure-CoT solvers carry no tools, so the tidy way to fold one into a larger workflow is as a solve() oracle a general agent dispatches to, not a bespoke chat window: serve it on a batching endpoint like oMLX that keeps it resident beside the agentic driver, and let the driver call it when it meets a sub-problem it should not attempt itself.

5.2 For agentic flows

These are the weights a coding harness needs:

Model	Role	Size	Sampling	Run via	Why
Hermes-4-14B (Nous)	function-calling	~8 GB	temp 0.6 / top-p 0.95 / top-k 20	MLX / GGUF	Nous’s small tier (Qwen3-14B post-train); function-calling is the design centre.
Hermes 4.3-Seed-36B	daily-driver	~20 GB	temp 0.6 / top-p 0.95 / top-k 20; Llama template	Ollama / MLX	Matches the 70B at half the size; steerable, 32k ctx.
Hermes-4-70B	heavyweight	~40 GB	temp 0.6 / top-p 0.95 / top-k 20	MLX / GGUF	Marginally stronger than the 36B at twice the memory — run it only to check the size earns its cost.
Qwen3.6-35B-A3B	daily-driver	~22 GB	thinking temp 1.0 / top-p 0.95 / top-k 20 (0.6 for coding); non-thinking 0.7 / top-p 0.8; never greedy	Osaurus one-click	3B-active MoE, 256K ctx (1M via YaRN), vision; MTP for faster decode (new on Mac — disable if output loops).
Nemotron-3-Nano-Omni-30B-A3B	multimodal	~20 GB	temp 0.6 / top-p 0.95	Osaurus (JANGTQ4), or any MLX runtime (MXFP4)	NVIDIA’s omni all-rounder — hybrid Mamba+MoE, native text/image/audio/video, 256K ctx; tools + maths in a compact bundle..
Qwen3.5-35B-A3B	fallback	~23 GB	same recipe as 3.6; never greedy	Osaurus / Ollama	Same MoE shape, more battle-tested, no MTP.

5.3 Diffusion models

Everything above is autoregressive. Diffusion LLMs are a thing though. Do any run locally? I know of one: DiffusionGemma — Google’s experimental Gemma 4 variant, which denoises a whole 256-token canvas in parallel instead of emitting tokens left to right. OsaurusAI/diffusiongemma-26B-A4B-it-MXFP8 runs natively in Osaurus through its vmlx-swift block-diffusion engine, ~26 GB on disk and ~24 GB resident.

Manage expectations on speed. Docs report 28–42 tok/s at 48 denoising steps: Osaurus defaults to 16, roughly twice as fast as the bundle default and still coherent, and the output falls apart below 12. Quality trails plain Gemma 4 as well. Vision, tool-calling, and a reasoning channel all work in this checkpoint; audio and video do not.

So it is not the agentic daily-driver. I am interested in the interaction model though: bi-directional attention over the canvas makes it potentially useful for infilling and structure-preserving rewrite, which is a different way to drive a coding tool than streaming tokens into a chat box.

6 The JANG ecosystem

Osaurus wraps osaurus-ai/vmlx-swift-lm, a Swift MLX inference engine. That engine is a Swift port of jjang-ai/vmlx, a Python engine. Both load weights from the JANGQ-AI Hugging Face org, a zoo of mixed-precision quantised models in a custom format called JANG — converted with JANG Studio, a native macOS wizard, with the newer codebook variant branded JANGTQ (“JANG TurboQuant”). The same person wrote each of those — Jinho “Eric” Jang (Irvine, California; also Osaurus’s lead/only engineer). There is a parallel desktop app, MLX Studio, by the same author, running the Python engine and surfacing more experimental features (image generation, agentic tool calling, in-app model conversion). The Jang family of enterprises is a tightly integrated stack: runtime, quant format, model zoo, two GUIs — one developer. That vertical integration buys fast iteration and a coherent feature set across the chain. The downside is that if Jang loses interest, switches jobs, or gets hit by a bus, the lot — JANG quants and JANG-format model files included — becomes abandonware. There is some community wariness about this; see the r/LocalLLaMA “Is MLX Studio legit?” thread. It is all open source, so in principle we could maintain it ourselves if he walks away.

Jang is, on the visible evidence, a talented coder. His public GitHub profile records ~4,000 contributions in the last year, which is a lot even for AI-assisted coding, assuming it actually works (which it does, mostly). His JANG repo lists a steady stream of model-architecture support landing within days of each new release. On Apple Silicon at the high end he is doing things nobody else is doing.

6.1 Tech stack

JANG (“Jang Adaptive N-bit Grading”) is mixed-precision quantisation for MLX. Standard MLX quantisation compresses every tensor to the same bit width. JANG classifies tensors by sensitivity — attention and MoE router layers (small share of params, large share of model behaviour) get 6–8 bits; expert MLPs get 2–4. The hybrid network is a mildly extended version of the standard MLX safetensors format with a per-tensor bit-width manifest. At the same total size, accuracy improves, notionally. The pitch is “GGUF for MLX”, which … sounds good? I’m not really competent to judge. Apparently llama.cpp’s K-quants do something similar.

The other part of the GGUF quality story is the calibration data fed in at quantisation time — which is why a bartowski/…-GGUF repo (like the Nemotron one) is a slightly different, usually better thing than a bare K-quant of the same weights. I am unsure if JANG does this.

jangq.ai claims impressive performance, regularly beating models with a larger footprint. At least one third-party benchmarker is impressed.

Osaurus is JANG native. Pull a model from JANGQ-AI and it loads — usually. The Swift engine’s coverage tracks the JANGTQ path; a plain JANG_* quant of an exotic architecture can fail at weight-load (notably Cascade-2 doesn’t work — the Python jang-tools stack handles those, the Swift engine does not yet). Elsewhere, support is partial: MLX Studio, vMLX, and oMLX all load JANG natively; LM Studio / Ollama / Jan not yet. From Python: uv pip install “jang[mlx]”, then jang_tools.loader.load_jang_model(...).

6.2 MLX Studio

MLX Studio is the JANG/Osaurus author’s other Mac desktop app — Electron + Python rather than Swift, broader feature surface (image generation via Flux and Z-Image, ~26 built-in agentic tools, in-app GGUF→MLX and MLX→JANG conversion, an Anthropic-compatible API). Install via the signed DMG on the releases page, or engine-only with uv tool install vmlx and vmlx serve mlx-community/<repo> (OpenAI-compatible on localhost:8000).

7 Antirez and DwarfStar

There is another weird Mac-only stack of interest to me: Salvatore Sanfilippo — antirez, the author of Redis — wrote some custom Apple Silicon inference code to run DeepSeek V4 Flash on a 128 GB MacBook, and a whole tiny supergroup of famed developers has grown up around it.

The approximate trajectory is as follows. April 2026: apparently moments after the DeepSeek V4 release, antirez drops antirez/llama.cpp-deepseek-v4-flash, a fork of llama.cpp with 2-bit quantisation, plus the matching GGUF at antirez/deepseek-v4-gguf.

A month later, he drops a from-scratch native Metal inference engine, ds4 (DwarfStar 4 to its friends) narrowly targeting DeepSeek V4 Flash and, I guess, a narrow family of derivatives. Targets M3 Max, M3 Ultra, and M5 Max specifically. Reported numbers are pretty snappy — ~14–15 tok/s decode at 62K context on an M3 Max 128 GB, ~450 tok/s prompt-processing on an M5 Max for a 10k-token codebase.

Like JANG, this is a small, specialised stack run by one person; except there is an influential community.

7.1 Running DwarfStar via the `pi` stack

The default harness for ds4 seems to be: pi, an MIT-licensed agent harness by Mario Zechner (badlogic, of libGDX fame) — itself a strong offline coding agent once a model is behind it. There is an easy install via the pi extension by Armin Ronacher (mitsuhiko, of Flask): mitsuhiko/pi-ds4. It handles process management for ds4-server — per-PID leases, watchdog shutdown, OpenAI-compatible local endpoint on 127.0.0.1:8000:

pi install https://github.com/mitsuhiko/pi-ds4

First-time install clones antirez/ds4, builds it, downloads the GGUF (~87 GB), and registers a ds4/deepseek-v4-flash model with pi. Subsequent runs spawn the server on demand and shut it down when no client process holds a lease. OpenClaw embeds pi, so the same extension can in principle load there.

pi from the terminal opens a TUI (“textual user interface” — I think that’s what it means, i.e. it lives in the terminal).

Audrey Tang maintains audreyt/pi-ds4, a fork that swaps in cyberneurova’s abliterated IQ2XXS quants and turns on uncertainty-mode directional steering by default — an activation-space edit that puts the model into “this is a contested question” mode on CCP-sensitive topics (Taiwan, Crimea, Kashmir, Western Sahara).

7.2 Manual setup for non-pi harnesses

Outside the pi ecosystem, the manual setup is four commands plus a config edit.

# antirez/ds4 for upstream; also audreyt/ds4 looks cool
# optimisations + steering-vector work — pick one
git clone https://github.com/audreyt/ds4
cd ds4
make
tmutil addexclusion -p (realpath ./gguf)
./download_model.sh                     # ~87 GB into ./gguf/
./ds4-server                            # listens on 127.0.0.1:8000

For lifecycle, we could wrap ./ds4-server in a launchd plist with KeepAlive: true; this is probably not what we want on a typical laptop where we do other things than inference, like, you know, use it as a laptop. I think pi is more automatic in that regard.

ds4-server’s context window is set at launch via --ctx <tokens> (max accepted per conversation); output length is a per-request API field, not a launch flag. --kv-disk-dir <path> (with --kv-disk-space-mb <n>) persists the KV cache to disk, so a prefix survives restarts and session switches rather than being reprocessed — durable prefix storage, not a long-context spill. Thinking mode is on by default, toggled per request, running DeepSeek’s reasoning mode. DeepSeek V4 Flash nominally supports 1M tokens, but ds4 is RAM-bound: the 2-bit IQ2XXS weights are ~81 GB, and a full 1M-token KV/index sits around 26 GB on top. Rough budget on unified memory:

64 GB: 50k–150k --ctx with headroom.
96 GB: 150k–250k works but is tight; quit Slack.
128 GB: 200k–300k is comfortable; >300k starts risking OOM.
1M: only with very generous memory and nothing else running.

If a client (Hermes, OpenClaw, OpenCode, anything OpenAI-compatible) advertises a context larger than --ctx, requests will get cut off — match the client’s contextWindow / limit.context to the server’s --ctx. DeepSeek’s sparse attention means raising --ctx doesn’t blow up compute the way dense attention would, but RAM is still a constraint. For most interactive coding, 32k–100k plus a retrieval layer beats brute-forcing the whole history into the prompt. See antirez/ds4’s README and the OpenClaw ds4 provider docs for the full flag list and client-side config.

Reasonable defaults:

./ds4-server --ctx 200000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

To bake in audreyt’s directional-steering defaults:

DS4_DIR_STEERING_FFN=-0.75 \
DS4_DIR_STEERING_ATTN=0 \
DS4_REPRODUCIBLE=1 \
./ds4-server

To use with Hermes, add an OpenAI-compatible provider entry to the Hermes config (sketch — confirm the exact schema with hermes config):

# ~/.hermes/config.yaml
custom_providers:
- name: ds4
  base_url: http://127.0.0.1:8000/v1
  model: deepseek-v4-flash
  models:
    deepseek-v4-flash:
      context_length: 200000

From inside Hermes, /model ds4/deepseek-v4-flash — matching the provider name in the YAML. Done.

Anyway, this gets us a generic token endpoint, so we’re free to plug in whatever on the front end.

The protagonists of this play have a lot of clout — antirez (Redis), mitsuhiko (Flask), badlogic (libGDX), and audreyt (Taiwan’s former Digital Minister, Pugs / Perl 6). Some kind of critical mass seems feasible for a certain type of nerd.

8 Memory management

We need to think about how much memory our machine has overall, and how much of that it will let us use for MLX workloads.

On the first point, TIL that macOS’s “Memory Used” indicator does not measure how much RAM is committed in the way I assumed. It counts caching usage in some unproductive way. “Memory” — green / yellow / red in Activity Monitor, or Pages purgeable and Pages compressed from vm_stat — measures available RAM. macOS aggressively fills RAM with discardable file-cache pages. mactop is a handy resource monitor that doesn’t itself use too much memory.

On the second point, there are limits on how much any given process is allowed to take up of the precious system memory. macOS sets a hard limit on how much RAM Metal — and therefore MLX — is allowed to wire (lock into physically resident, GPU-accessible memory). The default is ~67% on Macs ≤36 GB and ~75% on larger ones. On a 128 GB Mac that means MLX refuses to allocate past ~96 GB, regardless of how much actually-free memory there is. Raise it at runtime:

# Cap MLX at 112 GB — leaves ~16 GB for the OS and other apps
sudo sysctl iogpu.wired_limit_mb=114688

# Confirm
sudo sysctl iogpu.wired_limit_mb

# Reset to default
sudo sysctl iogpu.wired_limit_mb=0

This does not persist across reboots — we would need to wrap it in a LaunchDaemon or /etc/sysctl.conf entry to make it sticky.

Setting it to the full 128 GB is not wise. If MLX wires more than the OS can spare, the machine kernel-panics.

The runtimes also manage this themselves, each in its own way: mlx-lm wires the memory occupied by model and cache when a model is large relative to RAM (macOS 15+), and the Swift stack under Osaurus exposes wired-memory policies and tickets that raise the process limit around active inference rather than pinning one fixed number.

But also, before launching a big run:

sudo purge flushes the file cache so the OS has clean room to allocate. Available RAM jumps; subsequent file I/O is slower until the cache refills.
Quit Electron apps. Slack, Discord, Cursor, VS Code, Chrome will routinely pin 4–8 GB each.
MLX_LM_CACHE_LIMIT=0 (env var) prevents MLX’s internal allocation cache from growing unboundedly during long sessions — useful for sustained embedding or agent workloads.

The weights are a fixed cost. The longer the session, though, the more memory we need to hold that too. Every token in the current context lives in the KV cache, which grows as the conversation does.

How much the session costs depends on the architecture. A classic dense transformer keeps a key and value vector per layer for every token, so the cache scales with context × layers × width at a couple of bytes apiece — gigabytes for a long context, with a 256K agentic loop running to gigabytes on its own. kipply’s inference-arithmetic post has the per-token formula; the apxml VRAM calculator looks it up per model, Apple Silicon included. Grouped-query attention already shrinks this, and sparse-attention or SSM-hybrid models (Nemotron-Cascade, DeepSeek V4) shrink it much further, so on those a long context costs far less RAM than dense attention does.²

When the cache is the part that will not fit, we may be able to quantize it, cap the context, or move to one of the cheaper architectures above.³

9 Feeding PDFs in

See PDF ingestion.

10 Let’s break things

I came to understand the transformers/llama.cpp split by breaking it.

The Hugging Face and Ollama versions of mxbai-embed-large are nominally the same model — same upstream weights — but each stack implemented its own tokenizer. On plain prose the two mostly agree, I think; on markdown they can disagree by a few percent on how many tokens a chunk takes. Best not to mix and match. For embeddings on this blog I went all-transformers. For chat through a live server — where tokenisation stays internal to one stack and we eyeball the output — Ollama is fine.

11 Excluding model dirs from backups and indexing

The model weights are enormous and waste space in backups. There’s no point

backing up a quantised .gguf we can pull again in two commands, nor
indexing .safetensors files for Spotlight — they are opaque binary blobs and Spotlight will spin happily for hours grinding nothing useful out of them.

oMLX’s SSD KV cache belongs on the list too — same opaque-blob logic, but it churns: blocks are written and evicted every session, so leaving it in Time Machine re-snapshots gigabytes on every hourly pass rather than once. Exclude ~/.omlx/cache specifically, not all of ~/.omlx, so the small settings.json next to it stays backed up.

Solution!

# One list, two background services to opt out of
model_dirs=(
  ~/.cache/huggingface                 # transformers, sentence-transformers, and mlx-lm/mlx_lm.server all cache here
  ~/.cache/modelscope                  # ModelScope cache (Alibaba’s HF; override: MODELSCOPE_CACHE)
  ~/.cache/uv
  ~/.ollama/models
  ~/.lmstudio
  "$HOME/Library/Application Support/Jan/data/llamacpp/models"
  "$HOME/Library/Application Support/Jan/data/mlx/models"
  ~/MLXModels                          # shared MLX served-models tree: Osaurus default (OSU_MODELS_DIR) + oMLX --model-dir
  ~/.mlxstudio/models                  # MLX Studio default
  ~/.omlx/cache                        # oMLX SSD KV cache — regenerable + high-churn; exclude this, not all of ~/.omlx
)

# Time Machine — sticky exclusion keyed to the path string
for d in "${model_dirs[@]}"; do
  [ -d "$d" ] && sudo tmutil addexclusion -p "$d"
done

# Spotlight — drop the Apple-documented marker file in each directory
for d in "${model_dirs[@]}"; do
  [ -d "$d" ] && touch "$d/.metadata_never_index"
done

# Confirm a few
tmutil isexcluded ~/.cache/huggingface
ls -la ~/MLXModels/.metadata_never_index

.metadata_never_index is the Apple-supported marker file that tells mds_stores to skip the directory and everything under it; the file is empty and the marker is the filename.

If we ever want to re-index a directory (a model dir promoted to “actual content”), rm .metadata_never_index and mdimport -r <dir> puts it back.

12 Incoming

antirez’s “DeepSeek-V4-Flash on a MacBook M5 Max” — the demo video for the DwarfStar section.
Vicki Boykis — Running local models is good now

Footnotes

The documentation claims it is ln -sf “/Applications/Osaurus.app/Contents/MacOS/osaurus” “$(brew --prefix)/bin/osaurus”; but I think this is a typo — that launches the app, not the CLI helper.↩︎
MoE does not help here — it trims the active weights, not the cache.↩︎
On llama-server, --cache-type-k q8_0 --cache-type-v q8_0 -fa roughly halves it; on Ollama, OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0. mlx_lm.server will not cap context itself, so the cap goes in the harness via limit.context / contextWindow.↩︎