Running LLMs locally on a Mac

Osaurus, Ollama, and the transformers path

2026-05-23 — 2026-05-23

In Which Osaurus Is Recommended as the macOS GUI for Local Language Models, Ollama as Its Embedding-Model Companion, and Hugging Face Transformers as the Programmatic Path, with a Cautionary Tokenizer Mismatch Between the Latter Two.

computers are awful
machine learning
neural nets
NLP
UI
Figure 1

A twin post to front-end clients for AI image models, but for text. The local-LLM ecosystem on Macs is pretty luxurious, with a profusion of GUI options and Linux-y infra, and some specialised tooling that lags the community frontier but is not bad. Also, during the 2026 ramageddon, Macs suddenly look like remarkably good deals for high-RAM parallel-compute machines. I accidentally started going unreasonably deep and technical on this in the SOV repo. That repo really targets my coding assistant. Here is a human-facing version. I made a different one for the purely recreational use of image generation.

1 The stack

Useful vocabulary for the rest of this page, from the weights upward:

Model
the weights themselves — Qwen, DeepSeek, mxbai-embed, etc. Distributed as .safetensors from Hugging Face.
Quantization format
how the weights are stored on disk — GGUF for llama.cpp, MLX safetensors for MLX, JANG for Jinho Jang’s stack. Smaller files, slightly less faithful inference vs the original.
Runtime / inference engine
the code that runs the matmuls — llama.cpp, MLX, mlx-lm, vmlx-swift-lm, antirez’s ds4. Where the GPU work happens.
Server / daemon
a long-lived process that wraps the runtime in an OpenAI-compatible HTTP endpoint — ollama serve, llama-server, mlx_lm.server, Osaurus, ds4-server.
Harness / agent loop
the orchestration layer over the server — manages conversation state, tool calls, system prompts, multi-turn agent loops. Osaurus has one built in; pi is a popular standalone one; Aider, OpenCode, and Claude Desktop sit in the same role.
Frontend / chat client
the human-facing surface — Osaurus’s chat window, Jan, LM Studio, MLX Studio, various web UIs.

Most apps on this page are vertical bundles — Osaurus is frontend + harness + server + runtime in one, Ollama is server + runtime — but it pays to know which layer we’re looking at when an abstraction leaks (the tokenizer divergence below is a good example).

2 Just chat with a model

The fastest path from zero to local-LLM is a desktop app that bundles a model browser, a chat window, and an inference engine. There seem to be several main contenders. The Mac-native Osaurus is IMO the best, but it is a risky choice.

2.1 Osaurus

Osaurus (MIT, brew install --cask osaurus, osaurus-ai/osaurus) is Swift-native, no Electron, no Python, and behaves like a proper Mac app.

I’m YOLOing all in on this because it seems efficient and easy. It locks us in to the Mac ecosystem so might not be for everyone. Also, it’s run by one person, so the bus factor is 1, which is risky. But you guys — it’s so good!

The window has a model picker, a chat pane, a status indicator; the inference engine underneath is MLX (Apple’s own array framework for the M-series), so it gets fast tokens-per-second, fully leveraging the hardware.

Osaurus is also a server — it exposes OpenAI-, Anthropic-, and Ollama-compatible HTTP endpoints on localhost:1337, seemingly all three at once, which is some deep magic. Once we have it installed, anything else we want to point at a local model (code editor, embedding pipeline, Claude-Code-style tool) can talk to this local endpoint. There is also an agent harness with MCP server-and-client; think Claude Desktop, maybe slightly rougher but with some extra features.

Osaurus is also the intended runtime for a custom mixed-precision MLX quantization format called JANG.

2.2 Jan

Jan is a FOSS cross-platform option. It looks nice. The UI is built on the mildly cursed Electron, but the plus side is that it runs on Linux, Windows, and macOS. It supports both llama.cpp (via Cortex) and MLX backends. For the cross-platform stuff it looks pretty good. For brute-force running non-Apple-optimised models, it seems to go.

2.3 LM Studio

LM Studio is closed-source. It seems relatively slick and turnkey. It is not free for commercial use. Still runs llama.cpp underneath. I’m mildly skeptical of it simply because so many Cool LLM Technologies have special bug fixes or alternate install paths for LM Studio. It might be that this is just sampling bias, and more people file LM Studio bug reports because more people have LM Studio. But I suspect it has a slightly non-standard stack. [TODO clarify]

3 Serving a model headless

Once we want a model serving as a daemon (“token fountain” we say at my work) rather than a chat window — a code editor, an embedding pipeline, a script that calls out to a local model — we want a long-lived daemon with an OpenAI-compatible API.

If we already have Osaurus running we are mostly done; it is already that daemon, listening on localhost:1337 with three flavours of compatible API. But also, for reasons of not installing an idiosyncratic stack, or superior customization, we might want to install the standard Linux server stack.

3.1 Ollama

Ollama (brew install ollama) is a llama.cpp wrapper with its own model registry — fast enough, wide model coverage, and notably good for embedding models :

brew services start ollama
ollama pull qwen3  # LLM/chats etc
ollama pull mxbai-embed-large    # also handles embeddings

Anything OpenAI-API-compatible can now point at http://localhost:11434/v1. It’s smart to make it evict idle models, or it will load up many and tank the machine. OLLAMA_KEEP_ALIVE=15m does that.

Gotchas:

3.2 (even-more-)Power-user options

llama.cpp itself ships a server: llama-server -m model.gguf exposes the same OpenAI-compatible endpoint with no daemon, no registry, no opinions. brew install llama.cpp. I’m not really across when this would seem like a good idea? Ultrahobbyists? Trying to get a job at Anthropic? If I were going this deep I’d probably run it via transformers in my own Python process.

Apple Silicon optimised: mlx-lm ships its own server too: uv tool install mlx-lm then mlx_lm.server --model mlx-community/<repo>. One model per process; restart to switch.

4 Programmatic access via transformers

When we want to do things to a model — embed text, fine-tune, run interpretability tools, sample from internal layers, anything that touches the model internals — we drop down to Hugging Face transformers in our own Python process. Also transformers is just a process I can run; it doesn’t need or want to be a server.

Embeddings for search are why I currently do this. For this I want sentence-transformers, a thin wrapper around transformers that exposes the embedding API:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
embs = model.encode(texts, normalize_embeddings=True, convert_to_numpy=True)

The first call downloads the weights, the tokenizer config (tokenizer.json), and the model config from huggingface.co into ~/.cache/huggingface/hub/. Inference runs in our process via PyTorch. Tokenization runs in the same process, via HF’s Rust tokenizers library reading the same tokenizer.json the model was published with. One process, one library, one set of files.

On Apple Silicon we get a 15× speedup over fp32 by switching to float16 on MPS, with indistinguishable quality:

import torch
gpu = torch.cuda.is_available() or torch.backends.mps.is_available()
kwargs = {"model_kwargs": {"dtype": torch.float16}} if gpu else {}
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1", **kwargs)

The fp16 path matters on GPU or Apple Silicon, otherwise stay in fp32.

For text generation (rather than embedding) the equivalent is AutoModelForCausalLM.from_pretrained(...). PyTorch is rarely the fastest path on Apple Silicon — llama.cpp and MLX usually win on tokens-per-second — but it is the path that lets us see what the model is doing without arsing around. Activations, attention patterns, hidden states, custom sampling etc. are all possible from the Python prompt.

5 The JANG ecosystem

Osaurus wraps osaurus-ai/vmlx-swift-lm, a Swift MLX inference engine. That engine is a Swift port of jjang-ai/vmlx, a Python engine. Both load weights from the JANGQ-AI Hugging Face org, a zoo of mixed-precision quantised models in a custom format called JANG, defined in jjang-ai/jangq. The same person wrote each of those — Jinho “Eric” Jang (Irvine, California; also Osaurus’s lead/only engineer). There is a parallel desktop app, MLX Studio, by the same author, running the Python engine and surfacing more experimental features (image generation, agentic tool calling, in-app model conversion). The Jang family of enterprises is a tightly integrated stack: runtime, quant format, model zoo, two GUIs — one developer.

Jang is, on the visible evidence, a gun coder. His public GitHub profile records ~4,000 contributions in the last year, which is a lot even for AI-assisted coding, assuming it actually works (which it does). His JANG repo lists a steady stream of model-architecture support landing on the order of days after each new release. On Apple Silicon at the high end he is doing things nobody else is doing.

5.1 Tech stack

JANG (“Jang Adaptive N-bit Grading”) is mixed-precision quantisation for MLX. Standard MLX quantisation compresses every tensor to the same bit width. JANG classifies tensors by sensitivity — attention and MoE router layers (small share of params, large share of model behaviour) get 6–8 bits; expert MLPs get 2–4. The hybrid network is a mildly extended version of the standard MLX safetensors format with a per-tensor bit-width manifest. At the same total size, accuracy improves, notionally. The pitch is “GGUF for MLX”, which … sounds fine? I’m not really competent to judge. Apparently llama.cpp’s K-quants do something similar.

jangq.ai claims impressive performance: at time of writing, JANG_2L at 82.5 GB scoring 74% MMLU against MLX 4-bit at 119.8 GB scoring 26.5% on MiniMax-M2.5; 397B-parameter models fitting on 128 GB Macs. The page also notes it has been “filtered to decisive smaller wins only” — close comparisons and unfavourable cases are not shown — so calibrate accordingly. At least one third-party benchmark is pretty impressed.

Osauraus is JANG native. Pull a model from JANGQ-AI and it loads. Elsewhere, support is partial: MLX Studio and vMLX natively, LM Studio / Ollama / Jan not yet. From Python: uv pip install “jang[mlx]”, then jang_tools.loader.load_jang_model(...).

5.2 Also it is one lone genius

Jang is a guy who owns every layer — quant format, both runtimes, both desktop apps, the model zoo on Hugging Face. Tight vertical integration buys fast iteration and a coherent feature set across the chain. Concretely: if Jang loses interest, switches jobs, gets hit by a bus etc., JANG quants and JANG-format model files become abandonware. There is some community wariness about this — see the r/LocalLLaMA “Is MLX Studio legit?” thread. That said, it is all open source, so we can potentially maintain it if he bounces.

5.3 MLX Studio, briefly

MLX Studio is the same author’s other Mac desktop app — Electron + Python rather than Swift, broader feature surface (image generation via Flux and Z-Image, ~26 built-in agentic tools, in-app GGUF→MLX and MLX→JANG conversion, an Anthropic-compatible API). It doesn’t quite land for me, at least not in comparison to Osaurus.

6 Antirez and DwarfStar

Filed under the same “one developer, narrow Mac-only stack” archetype as JANG, but at the opposite end of the breadth axis: Salvatore Sanfilippo — antirez, the author of Redis — has been writing custom Apple Silicon inference code to run DeepSeek V4 Flash on a 128 GB MacBook.

The story has two acts.

First (April 2026): a fork of llama.cpp at antirez/llama.cpp-deepseek-v4-flash, with 2-bit quantisation targeting 128 GB Macs, plus the matching GGUF at antirez/deepseek-v4-gguf. At that point DeepSeek V4 was so new that upstream llama.cpp had no support, MLX had none, Ollama listed it only as a cloud model, and LM Studio threw Unsupported safetensors format on the weights.

Second (May 2026): a from-scratch native Metal inference engine called ds4 (DwarfStar 4). The README is explicit about the scope — “a small native inference engine specific for DeepSeek V4 Flash. It is intentionally narrow: not a generic GGUF runner.” Sidesteps both mlx-lm and llama.cpp; targets M3 Max, M3 Ultra, and M5 Max specifically. Reported numbers from independent benchmarks: ~14–15 tok/s decode at 62K context on an M3 Max 128 GB, ~450 tok/s prompt-processing on an M5 Max for a 10k-token codebase.

The same solo-developer caveats from the JANG section apply, more acutely — this is a single-model runtime by one person, and the model in question (DeepSeek V4 Flash) has been out for a few weeks. But it is a useful counter-example to the assumption that Apple Silicon LLM infra has to be a broad ecosystem play. When the model matters enough and the developer is sharp enough, very narrow vertical engines can land first — frontier-class inference on a MacBook, no Python, no PyTorch, no MLX, weeks before any of the broader stacks catch up.

6.1 Running DwarfStar via the pi stack

The way DwarfStar gets from git clone antirez/ds4 and make to “I can use this from my editor” is via a harness: pi, an MIT-licensed agent harness by Mario Zechner (badlogic, of libGDX fame). pi sits at the harness layer in the stack vocabulary above — it runs tool-call loops, manages conversation state, and talks to whichever OpenAI-compatible endpoint we point it at. That happens to be DwarfStar here, but pi is a general-purpose tool worth knowing about in its own right — npm-installed coding-agent CLI, a unified multi-provider LLM API (OpenAI, Anthropic, Google, local), an agent runtime with tool calling, a TUI library, the lot.

Two layers to keep straight: ds4-server is the server (a generic OpenAI-compatible endpoint on 127.0.0.1:8000); pi is the harness and its own frontend (the pi TUI is the chat surface). The two are independent. Anything that speaks the OpenAI API can use ds4-server as a backend without pi at all — Osaurus pointed at the port via its custom-provider setting, Cursor or Continue configured with a local OpenAI provider, a curl one-liner from a bash script. pi is one good client choice on top of that endpoint, the one we would pick for coding-agent workflows; if we want a chat window instead, point Osaurus at the same port; if we want both at once, run both.

The DwarfStar wiring is a pi extension by Armin Ronacher (mitsuhiko, of Flask): mitsuhiko/pi-ds4. It handles process management for ds4-server — per-PID leases, watchdog shutdown, OpenAI-compatible local endpoint on 127.0.0.1:8000:

pi install https://github.com/mitsuhiko/pi-ds4

First-time install clones antirez/ds4, builds it, downloads the GGUF (~87 GB), and registers a ds4/deepseek-v4-flash model with pi. Subsequent runs spawn the server on demand and shut it down when no client process holds a lease.

Audrey Tang maintains audreyt/pi-ds4, a fork that swaps in cyberneurova’s abliterated IQ2XXS quants and turns on uncertainty-mode directional steering by default — an activation-space edit that puts the model into “this is a contested question” register on prompts the unsteered model would emit a memorised closed-form answer to (Taiwan, Crimea, Kashmir, Western Sahara). Per the fork’s README, a hedge-style system prompt alone does not flip the closed-form completion; the steering vector does, and the system prompt then supplies the specific positions for the model to draw from. The README has the full discussion, a worked example, and the env-var knobs to turn it off.

So in this one corner of the Apple-Silicon LLM world we have antirez (Redis), mitsuhiko (Flask), badlogic (libGDX), and audreyt (Taiwan’s former Digital Minister, Pugs / Perl 6) all converging on one model and one MacBook chip family. That’s unusual talent density for a single-purpose runtime.

7 Memory tricks

TIL macOS’s “Memory Used” indicator does not measure how much RAM is committed in the way I assumed. It counts caching usage in some way. “Memory” — green / yellow / red in Activity Monitor, or Pages purgeable and Pages compressed from vm_stat measures available RAM. macOS aggressively fills RAM with discardable file-cache pages.

There is a separate, hard cap that bites specifically for our LLMs.

7.1 The wired-memory ceiling

Apple sets a hard limit on how much RAM Metal — and therefore MLX — is allowed to wire (lock into physically resident, GPU-accessible memory). The default is ~67% on Macs ≤36 GB and ~75% on larger ones. On a 128 GB Mac that means MLX refuses to allocate past ~96 GB, regardless of how much actually-free memory there is.

Raise it at runtime (Sonoma 14.x+):

# Cap MLX at 112 GB — leaves ~16 GB for the OS and other apps
sudo sysctl iogpu.wired_limit_mb=114688

# Confirm
sudo sysctl iogpu.wired_limit_mb

# Reset to default
sudo sysctl iogpu.wired_limit_mb=0

This does not persist across reboots — we would need to wrap it in a LaunchDaemon or /etc/sysctl.conf entry to make it sticky.

Setting it to the full 128 GB in that case would be bad. If MLX wires more than the OS can spare, the machine kernel-panics.

mlx_lm.server and (via its inheritance) vmlx-swift-lm — so Osaurus — call mx.set_wired_limit() on startup. The Swift side documents wired-memory policies and defaults to 112 GB on 128 GB machines, which seems like a good default for general use. Before launching a big run:

  • sudo purge flushes the file cache so the OS has clean room to allocate. Available RAM jumps; subsequent file I/O is slower until the cache refills.
  • Quit Electron apps. Slack, Discord, Cursor, VS Code, Chrome will routinely pin 4–8 GB each.
  • MLX_LM_CACHE_LIMIT=0 (env var) prevents MLX’s internal allocation cache from growing unboundedly during long sessions — useful for sustained embedding or agent workloads.
  • mactop is a good resource monitor that doesn’t itself use too much memory.

8 Let’s break things

I came to understand the transformers/llama.cpp split by breaking it.

The Hugging Face and Ollama versions of mxbai-embed-large are nominally the same model — same upstream weights — but each stack carries its own re-implementation of the tokenizer. On plain prose the two agree; on markdown they can disagree by a few percent on how many tokens a chunk takes.

I tripped over this by chopping documents to fit Ollama’s 512-token input limit using HF’s tokenizer to measure them — and watching Ollama reject chunks HF had measured as safely under the limit. Ollama has no (AFAICT) way to ask “how many tokens is this?”, so there’s no way to reconcile the two from outside.

Lesson, IMO: do not mix and match. For embeddings on this blog I went all-transformers — one process, one tokenizer, no IPC, no version skew. For sloppy stuff like chat, Ollama is fine.

9 Excluding model dirs from background services

Model weights are the bulk of what these tools write to disk, all of it re-downloadable. There is no point backing up a quantized .gguf we can pull again in two commands, and no point indexing .safetensors files for Spotlight — they are opaque binary blobs and Spotlight will spin happily for hours grinding nothing useful out of them. The same path list serves both jobs.

# One list, two background services to opt out of
model_dirs=(
  ~/.cache/huggingface
  ~/.cache/uv
  ~/.ollama/models
  ~/.lmstudio
  "$HOME/Library/Application Support/Jan/data/llamacpp/models"
  "$HOME/Library/Application Support/Jan/data/mlx/models"
  ~/MLXModels                          # Osaurus + MLX Studio default
)

# Time Machine — sticky exclusion keyed to the path string
for d in "${model_dirs[@]}"; do
  [ -d "$d" ] && sudo tmutil addexclusion -p "$d"
done

# Spotlight — drop the Apple-documented marker file in each directory
for d in "${model_dirs[@]}"; do
  [ -d "$d" ] && touch "$d/.metadata_never_index"
done

# Confirm a few
tmutil isexcluded ~/.cache/huggingface
ls -la ~/MLXModels/.metadata_never_index

Both tools require the directory to exist before they will accept it — hence the [ -d "$d" ] guard. tmutil -p records a sticky exclusion keyed to the path string, which is what we want for fixed locations. .metadata_never_index is the Apple-supported marker file that tells mds_stores to skip the directory and everything under it; the file is empty and the marker is the filename.

If we ever want to re-index a directory (model dir promoted to “actual content”), rm .metadata_never_index and mdimport -r <dir> puts it back.

10 See also