Front-end clients for AI image models
ComfyUI, InvokeAI, Draw Things, ChaiNNer and friends
2022-09-16 — 2026-05-22
In Which the Choice of Runtime on Apple Silicon — PyTorch, MLX, CoreML, or Custom Metal — Is Shown to Determine Which Models Fit in RAM and Which LoRAs May Be Stacked.
Front-end software for running AI image models on our own hardware — generation, editing, inpainting, upscaling, and chaining the lot together.
To assemble a whole stack we might use.
- generative image models for the text-to-image generation side (Stable Diffusion, Flux, SDXL and friends);
- editing images with machine learning for the editing side (FLUX.1 Kontext, Qwen-Image-Edit, Nano Banana, single-task tools);
- AI democratization for the community and its spicy back-stories (Stability AI → Black Forest Labs, CivitAI, the open-vs-corporate fault line).
For non-ML editing — vips, ImageMagick, GIMP scripting — see editing images using code and editing images using a GUI.
I am principally interested in clients that
- work on macOS
- on my local machine (i.e. use my local GPU)
Most of the options here do something like that. Some of them only work on Very Serious Infrastructure Which I Cannot Afford To Own But Might Rent In The Cloud, which is a secondary concern for me personally, but there are some notes on that too.
1 How models actually run on a Mac
Three runtimes coexist on Apple Silicon and none has won. Picking a client mostly means picking which runtime it uses.
- PyTorch + MPS (Metal Performance Shaders) — what most Python ML code uses on Apple Silicon. Coverage of new model architectures is fastest because PyTorch is the lingua franca; raw speed is decent but not optimal. ComfyUI, InvokeAI and the A1111-family forks pick this path.
- Apple MLX — Apple’s own array framework. Faster than MPS where it has coverage; coverage lags PyTorch by months. On image clients it shows up as add-on node packs (ComfyUI-MLX, mflux-ComfyUI, mflux standalone). Trades coverage for speed.
- Apple Core ML — lowest RAM footprint, runs on the Neural Engine, but every model needs offline conversion via
coremltools. Mochi Diffusion is the only mainstream consumer; this is why its model list lags everything else. - Custom Swift + Metal kernels — Draw Things sidesteps all of the above with its own implementation. Faster than MPS, broader coverage than CoreML, no install hoops; the price is closed source.
For Flux- and Qwen-class models on a 16 GB Mac, GGUF quants are the difference between “runs” and “doesn’t run”. GGUF is a compact file format for quantized model weights, borrowed from the llama.cpp world; quantizing stores the weights at lower numerical precision so the model takes up less memory, at some cost to quality. ComfyUI handles them via ComfyUI-GGUF, Draw Things handles them natively, others have partial support.
1.1 Choosing a runtime
The choice depends on a few axes: how much RAM I have, the model’s size, whether a quantized or MLX build has actually been published for it, and whether we are stacking CivitAI LoRAs. Older models are more likely to have these builds available. A fast model is never a downside, so it’s always worth checking for an MLX build. Here are some helpful decision rules to navigate the options, and some notes on the trade-offs of each. Draw Things sidesteps the whole thing — it picks precision and manages memory itself — so this is really the ComfyUI-level decision.
Rough sizing, to know which RAM column I am in: parameters × 2 bytes for fp16, plus the text encoder and VAE and a few GB of working overhead. FLUX.1 dev ≈ 24 GB + ~9 GB for the T5; SDXL ≈ 6.5 GB; FLUX.2 dev ≈ 64 GB.
| Runtime | Cuts memory to fit? | Speed vs MPS | CivitAI LoRA stacking | Weights it requires |
|---|---|---|---|---|
| PyTorch + MPS, full precision | No | Baseline | Clean | any .safetensors |
| PyTorch + MPS + GGUF | Yes — its whole job | ≈ baseline (bandwidth, not compute) | Clean | a published .gguf quant |
| MLX (mflux / ComfyUI-MLX) | Optional — can quantize, but not the point | Fastest — wins on compute (~30–70% off) | Friction — own format, and LoRA-loading and quantization are mutually exclusive | an mlx-community build |
| CoreML (Mochi Diffusion) | Yes — lowest footprint of all | ANE; not top speed on large models | None | a converted bundle |
| Draw Things | Automatic | Fast (own Metal stack) | Clean (auto-converts) | its own (handled for us) |
In full sentences:
- Does a build for this model exist in that path at all? The weights it requires column says what to look for — check the model directly on Hugging Face or Civitai. As a rough prior on whether the search pays off: PyTorch + MPS works almost immediately on model release because it takes raw
.safetensors; popular DiT models get GGUF quants within days; MLX and CoreML ports lag by months and are often absent. - Does it fit in RAM? If full precision overflows (the sizing above), the full-precision row is out and we need a memory-cutting path — GGUF, or CoreML on the smallest machines.
- Are we stacking CivitAI LoRAs? If so, AFAICT MLX and CoreML are out; we want PyTorch + MPS, full or GGUF.
If we have options, prefer MLX if it’s available because fast; full bf16 for top quality with no fuss; GGUF when fitting a large model, taking the largest quant that fits (Q5_K_M over Q4_K_S).
1.2 Client-level content filters
The runtimes above determine how a model runs; a separate question is whether the client adds a content filter on top. The two problems are independent — the same model weights can behave differently across clients depending on what the client loads alongside them.
| Client | Filter added? | Mechanism | How to remove |
|---|---|---|---|
| ComfyUI | No | The graph executes what is wired; nothing added by the application | N/A |
| InvokeAI | Yes | NSFW blur on generated output | --no-nsfw_checker at launch |
| A1111 / Forge Neo / SDNext | Yes | safety_checker CLIP classifier; blacks out on detection |
--disable-nsfw-filter flag |
| Draw Things | No | Defers to loaded model’s behaviour | Load an unfiltered base |
| Mochi Diffusion | No | Defers to the CoreML bundle | Converted model must be permissive |
The safety_checker used by the diffusers-based clients is a separate model — a small CLIP classifier not baked into the generation model’s weights. Removing it does not modify the generation model; it removes the post-generation output filter only.
For background on which base models and fine-tunes carry different alignment profiles in their weights, see NN models’ Creative latitude summary.
1.3 Memory budget by hardware class
The available memory determines which models and quantization levels are practical. Rough observed characteristics across the range, from the cheapest Apple Silicon to a maxed-out Studio:
16 GB Apple Silicon
At this tier quantization stops being optional — it is the price of running anything Flux-class at all. FLUX schnell via GGUF Q5_K_M (8.3 GB) or Q6_K (9.8 GB) fits comfortably, and Q4_K_S (6.8 GB) trades some quality for headroom. SDXL, its fine-tunes and Pony V6 XL run at fp16 (16-bit floating point — the usual unquantized precision) at roughly 6.5 GB of weights, LoRAs adding negligible overhead; SD 1.5 needs ~2 GB; Z-Image Turbo and FLUX.2 [klein] 4B (fp8 ~4 GB) also fit, with room for the text encoder and VAE. FLUX dev does not fit at full precision: GGUF Q4_K_S is the floor. The squeeze specific to Flux is the T5 text encoder — a separate ~9 GB component — so on 16 GB we run a quantized T5 alongside the quantized DiT, or neither fits.
Corollaries:
- FLUX.2 dev (32B) is out of reach at any quantization on this tier.
- One model resident at a time; switching backbones is a reload, not a fast swap.
- Local video (Wan, Hunyuan, LTX) is impractical — expect to run out of memory (OOM).
- Output resolution is itself part of the budget, so the 4 MP / 2 MP ceilings the larger models reach shrink here; plan to generate near 1 MP and upscale.
- Close other applications before inference. Draw Things manages the pressure automatically at some throughput cost; ComfyUI will OOM if overcommitted.
Below 16 GB — the 8 GB base Macs, the cheapest Apple Silicon — is SD 1.5 territory at 512 px, and the tier where Mochi Diffusion’s slim CoreML path makes sense: its ~150 MB working set leaves more of the budget for the model itself. SDXL runs but swaps painfully, and Flux-class is effectively off the table. People do generate on these machines using a combination of small models, small images, and patience.
128 GB Apple Silicon
Full-precision FLUX dev or schnell without quantization runs without drama, as do full ControlNet stacks resident in memory simultaneously. Multiple models can be loaded at once and switching between them is fast. Video models (Wan 2.2, Hunyuan) run via ComfyUI without special handling. Qwen-Image-Edit at full precision and SDXL with multiple simultaneous LoRA stacks are both straightforward at this scale. With memory no longer the constraint at this tier, the remaining free parameter is throughput — which is where MLX earns its keep, via mflux or ComfyUI-MLX for Flux-class work. FLUX.2 [dev] (32B) at full fp16 sits at around 64 GB and fits within the available headroom; FLUX.2 [klein] 9B is trivially small at this scale.
2 Choosing a client
Once the model lineages have narrowed which model, this is the “which app” decision (Mac, 2026). There is a trade-off between flexibility and usability.
| Tool | Mac install | indications |
|---|---|---|
| Draw Things | App Store, signed | Mac-native sweet spot |
| ComfyUI (+MLX) | Python+MPS or +MLX add-ons | Bleeding-edge, max compatibility, node-graph |
| InvokeAI | Official installer (MPS) | Canvas studio with inpaint brushes |
| Mochi Diffusion | Drag-drop, CoreML conversion required | Low-RAM CoreML |
| Forge Neo / SDNext | Python+MPS | A1111-style WebUI, current models |
| ChaiNNer | Electron + Python+MPS | Upscale/restore pipelines (no diffusion) |
Backbone support (May 2026); bold marks the default pick for each backbone:
| Tool | SD/SDXL | Flux dev/schnell | FLUX.1 Kontext | Qwen-Image-Edit | Video (Wan/Hunyuan/LTX) |
|---|---|---|---|---|---|
| Draw Things | ✓ | ✓ | ✓ | ✓ (2509/2511/Layered) | Wan |
| ComfyUI | ✓ | ✓ (incl. Flux 2) | ✓ | ✓ | Wan / Hunyuan / LTX / + |
| InvokeAI | ✓ (incl. SD3.5) | ✓ (+Krea / Redux / Fill, Flux 2 Klein) | ✓ | ✓ | – |
| Mochi Diffusion | ✓ | Flux 2 Klein only | – | – | – |
| Forge Neo | ✓ | ✓ | – | ✓ | Wan 2.2 |
| ChaiNNer | – | – | – | – | – |
Draw Things is a lazy default for most backbones — App Store install, no runtime fuss, and its own Metal stack is already fast. We might switch to ComfyUI on Flux or Kontext for node-level control, the newest architectures, or end-to-end pipelines that post-process what they generate — say generate an emoji, matte out the background, and export to a target format in one graph — and for any serious local video; InvokeAI is the pick when the job is canvas-and-mask editing rather than pure generation. Upscale-only and hosted work fall to ChaiNNer / Spandrel and Runway, outside this grid.
3 Generation + editing GUIs
These run text-to-image generation and image-to-image / inpainting on top of the same backbone (Stable Diffusion, SDXL, Flux, and so on). When the editing notebook talks about “mask + prompt” inpainting workflows, this is what it means.
Two terms recur in the walkthroughs below. A base model (or backbone) is the full generative model we load — text encoder, denoiser, and VAE, anywhere from a few to dozens of GB. A LoRA is a small add-on (often under 300 MB) trained against one backbone to push its style or subject; it does nothing on its own and only loads against a compatible base. The model lineages cover which backbones exist; the niche-LoRA menu covers the add-ons worth auditioning.
3.1 Draw Things
The most actively-maintained Mac-native client in the field; the closest thing to “ComfyUI’s model coverage with DiffusionBee’s install ease”. If we don’t already have a reason to prefer something else, this is the sensible default.
3.1.1 Walkthrough — adding a CivitAI LoRA on top of Flux
Suppose we want to try YFG Patents, the patent-figure Flux LoRA from the niche-LoRA menu. The full path from a clean install:
- Install the app — one click from the Mac App Store. The binary is signed, sandboxed, and has a download menu for the models it supports.
- Pull a base model — in the Settings tab, click Manage to open the Models panel and pick
FLUX.1 [dev]from the built-in list; it downloads on selection. The weights arrive pre-converted to the app’s own internal format (a few GB), with nosafetensors → diffusers → CoreMLdance like Mochi Diffusion demands. - Pull the LoRA — in that same Models panel, click Import Model → Enter URL… and paste the CivitAI link, setting the type to LoRA. The URL route sometimes fails; if it does, download the
.safetensorsfrom CivitAI and use Select from Files instead. Either way the app converts it to its internal format and tags it as a LoRA against the matching backbone (Flux dev in this case). LoRAs from other backbones don’t appear when a Flux model is loaded — fewer ways to wire it up wrong. - Generate — the LoRA appears in the LoRA panel with a strength slider. Set it around 0.7–1.0 to start, write a prompt (
an exploded view of a kettle, technical diagram, numbered callouts), hit generate.
The same pattern works for Z-Image Turbo LoRAs like Perfect Ink Drawing, with the added payoff that Z-Image Turbo sounds like it might be faster.
Hugging Face offers a slicker route for anything hosted there: a compatible model page has a Use this model → Draw Things button (part of HF’s local-apps integration) that deeplinks the weights straight into the app, skipping the manual import above. CivitAI has no such hook, so its models and LoRAs go through Import Model by hand.
Backbones: SD 1.5, SDXL, SD3, Flux dev/schnell, FLUX.1 Kontext for instruction edits, Qwen-Image-Edit 2509/2511 and Qwen-Image-Layered, Wan video (multiple variants), HiDream, Z-Image Turbo.
Affordances: LoRA loading and local LoRA training, ControlNet, inpainting, outpainting, infinite canvas. Memory pressure is managed automatically — Draw Things falls back to a smaller working set rather than OOM, at some cost to throughput. 5–7 GB of working memory typical; runs comfortably on a 16 GB Mac.
Failure mode: lags ComfyUI by weeks, not months, on genuinely novel architectures. We pay for the polish with closed source.
3.2 ComfyUI
A visual node-graph workflow system, first to support every new architecture (usually within days of release), and the power-user pick when we want to save, parameterize, re-run, or debug a workflow at the node level. Steepest learning curve in this list.
3.2.1 Two installation paths
There are two ways onto the platform: a desktop app and a Python source install. They run the same engine and we can share models between them; the difference is in how much of the Python plumbing we touch.
The ComfyUI Desktop DMG is an Electron wrapper around the upstream project. On first launch it asks where to put the install directory, then sets up a self-contained Python environment inside it, downloads PyTorch with MPS support, and brings up the node editor. It is not a sealed bundle — there is an in-app terminal, and we can drop into that environment later to add packages if we ever need to. Models sit wherever we pointed the installer; the install wizard explicitly offers to import an existing source install’s models/, workflows/, and settings/ so a desktop install and a source install can sit side by side without duplicating multi-GB checkpoints. An extra_models_config.yaml lets us point either install at additional model directories anywhere on disk, so several ComfyUI instances can share one model tree and never duplicate a multi-GB checkpoint. That sharing stops at the ComfyUI boundary, though: Draw Things and Mochi Diffusion convert models to their own internal formats on import and keep the converted copies inside their app containers, so anything used in both Draw Things and ComfyUI is stored on disk twice — there is no shared directory bridging a safetensors-native client and a converting one.
The Python source install suits us when we want hands-on control of the Python environment — to install experimental custom nodes from git, pin a particular torch build, or untangle a dependency clash by hand. The desktop app keeps that environment hidden and managed for us; the source install hands over the keys, at the price of having to manage it ourselves. If the words virtual environment and dependency resolver already sound like a threat, the Python packaging notebook is the place to start; the short version is that uv makes this about as painless as it gets on a fresh Mac:
If a custom-node pack refuses to build, dropping to the previous Python minor version usually clears it; stable PyTorch has solid Apple Silicon support now, so there’s no need to chase nightly builds.
Both paths support ComfyUI Manager for installing custom node packs from the UI. Install it once, restart, and the Manager button appears in the node editor.
3.2.2 Example one — Flux schnell via MLX, with a CivitAI LoRA
Say we want YFG Patents running on top of Flux schnell, accelerated through Apple MLX rather than PyTorch+MPS.
- Add the MLX node pack. Open Manager → search “ComfyUI-MLX” (thoddnn/ComfyUI-MLX) → install → restart. This pack uses Apple’s DiffusionKit underneath and tends to give ~30–70% wall-clock speedups on Flux compared to PyTorch+MPS. Mflux-ComfyUI is the older alternative; it still works, and is friendlier for people who’d rather not touch the terminal, but ComfyUI-MLX has more momentum as of 2026.
- Fetch the base model. MLX-flavoured Flux weights are hosted under
mlx-communityon HuggingFace, separately quantized for the framework — these are not the same files as the regularsafetensorsFlux releases. Drop them inComfyUI/models/diffusion_models/(the MLX loader nodes know where to look). - Fetch the LoRA. Download the YFG Patents
.safetensorsfrom CivitAI intoComfyUI/models/loras/. A LoRA file is small (often <300 MB) because it only stores low-rank deltas against the backbone’s weights. - Wire the workflow. In the editor:
MLX Model Loader→MLX LoRA Loader(point at the YFG file, strength ≈ 0.8) →MLX Sampler→VAE Decode→Save Image. Prompt the patent-figure idiom (exploded view, numbered callouts, dashed hidden lines, technical patent figure) and queue.
One quirk: with mflux, LoRA loading and on-the-fly weight quantization are mutually exclusive — if we want a quantized MLX model and a LoRA, we have to bake the LoRA into the weights ahead of time (see the Mflux-ComfyUI readme for the recipe). ComfyUI-MLX handles this more gracefully but is younger.
3.2.3 Example two — Flux dev via GGUF, on a 16 GB Mac
Flux dev at full precision is 23.8 GB in fp16, which does not fit a 16 GB machine. GGUF quantization is what makes it fit at all. DiT-backbone models (the diffusion-transformer architecture: Flux, SD3, Qwen-Image) compress well under GGUF; the older UNet-backbone models (SD 1.5, SDXL) do not benefit as much.
Add the GGUF node pack via Manager (search “ComfyUI-GGUF”, city96/ComfyUI-GGUF). The manual route works too:
Download a quantized checkpoint from city96’s HuggingFace repo. The trade-off is roughly:
Q4_K_S(6.8 GB, noticeably softer outputs),Q5_K_M(8.3 GB, the usual sweet spot),Q6_K(9.8 GB, near-fp16 quality). Drop the file inComfyUI/models/unet/. The T5 text encoder is a separate ~9 GB component and can be quantized independently with the matching CLIP loader nodes — we’ll often want a Q5 T5 alongside a Q5 DiT.Wire the workflow.
Unet Loader (GGUF)(categorybootleg) →DualCLIPLoader (GGUF)→ standardKSamplerandVAE Decode. A regular CivitAI LoRA (e.g., YFG Patents again) loads on top through the normalLoraLoadernode — no bake-in dance needed here, because we’re back in PyTorch+MPS land.On a 32 GB+ machine GGUF is optional — full fp16 fits, and GGUF buys nothing there (its win is memory, not speed). Older PyTorch builds had an MPS buffer bug that broke GGUF on Apple Silicon; current stable resolves it, so no version pin is needed.
The first two examples contrast deliberately: MLX gives speed on Apple Silicon but limits which weights we can load and how LoRAs compose; GGUF gives memory headroom for bigger models on smaller machines at the cost of staying in the PyTorch lane. Sometimes we want both, sometimes neither.
3.2.4 Example three — emoji assets: generate, matte, and export
The niche-LoRA menu notes that no current model emits a clean cut-out icon with transparency: we generate on a flat background and remove it as a separate step. ComfyUI does the whole chain in one graph, which is what makes it the right tool for minting a set of emoji rather than a one-off.
- Add a background-removal node pack via Manager — 1038lab/ComfyUI-RMBG bundles RMBG-2.0, BiRefNet and SAM, and its GroundingDINO option can cut out a subject named by text.
- Generate on a flat field.
Load Checkpoint(SDXL) →LoraLoader(fofr/sdxl-emoji, strength ≈ 0.8) →KSampler→VAE Decode. Prompt the subject, not a person:an emoji of a teapot, centered, plain white background. - Matte. Feed the decoded image into the RMBG node; it returns the subject on transparency. Emoji cut out cleanly because the edges are crisp and the background is flat — the hard cases that defeat background removal do not arise here.
- Resize and export.
Image Resizeto the target (128 px for Slack or Discord custom emoji) →Save Image, which preserves the alpha channel as an RGBA PNG. Queue the prompt over a list of subjects to mint a whole pack at once.
For exotic targets — multi-size icon sets, WebP, SVG, .icns — hand the matted PNGs to vips or ImageMagick as a final non-diffusion step.
Backbones: SD 1.5, SDXL, SD3/3.5, Flux dev/schnell/2, FLUX.1 Kontext, Qwen-Image, Qwen-Image-Edit, Wan 2.1/2.2, HiDream E1.1, LTX-Video, Hunyuan Video/3D, Omnigen 2, ACE Step audio. LoRA, ControlNet, inpainting, img2img, regional prompting all native.
Affordances: training is out of scope (we might use kohya for that?) Failure mode for unknown CivitAI checkpoints: usually “find the right node pack in the manager”, and they almost always load eventually.
3.3 InvokeAI
A polished web UI with a canvas-and-layers studio feel — closer to a Photoshop workflow than to a programming environment. The canvas-first pick when we want layers and inpainting brushes rather than node spaghetti. PyTorch+MPS underneath; no MLX path here.
3.3.1 Walkthrough — emoji-style edits on a portrait
A workflow that exercises what InvokeAI is for: pull a portrait onto the canvas, mask a region, and apply fofr/sdxl-emoji (from the niche-LoRA menu) to that region only. This is the kind of selective re-style that is awkward in Draw Things and tedious to wire by hand in ComfyUI.
- Install via the official launcher. The launcher is a small Electron app that manages the underlying Python install for us — it provisions a venv, pulls the right PyTorch wheel for Apple Silicon (MPS), and tracks updates separately from the app itself. Source install works too (
git clonethenuv pip install -e .), but the launcher is what the team supports and where the model importer is. - Pull a base model through the launcher’s built-in Model Manager. SDXL base is the sensible match for the fofr LoRA, since the LoRA was trained on SDXL. An SDXL LoRA cannot be stacked on a Flux base; the rank-decomposed weights only make sense against the backbone they were trained on. The Model Manager downloads HuggingFace
diffusers-format weights directly; CivitAIsafetensorsimport works through URL paste. - Pull the LoRA. Either drop fofr’s
pytorch_lora_weights.safetensorsinto~/invokeai/models/sdxl/lora/, or paste the HuggingFace URL into the Model Manager. InvokeAI identifies the architecture from the file’s metadata and rejects it cleanly if it doesn’t match the loaded base. - Use the canvas. Open the Unified Canvas tab, drag a portrait onto it, draw a mask around the face, and add the LoRA to the prompt stack (strength ~0.8 to start). The inpaint sampler respects the mask boundary while letting the LoRA drive style inside it — the canvas affordance Draw Things lacks.
The same pattern handles outpainting (extend canvas, mask the new area, generate), Kontext-based instruction edits (make this a watercolour), and SAM/SAM2-driven auto-segmentation (Segment Anything, Meta’s click-to-select segmentation model) when we can’t be bothered to draw the mask ourselves.
Backbones: SD 1.5/2.0, SDXL, SD 3.5 Medium/Large, Flux dev/schnell/Kontext/Krea/Redux/Fill, Flux 2 Klein 4B/9B, Qwen Image, Qwen Image Edit, CogView 4, Z-Image. GGUF, ckpt, and diffusers formats all loadable.
Affordances: LoRA + embeddings, SAM/SAM2 segmentation, unified canvas inpainting/outpainting. No video, no training.
Failure mode: the model picker rejects an unknown architecture cleanly rather than crashing — we know immediately whether a CivitAI download will work, which is more than ComfyUI offers when a custom node pack is missing.
3.4 Mochi Diffusion
A native SwiftUI app on Apple’s CoreML framework, running on the Neural Engine with the lowest RAM footprint on this list (~150 MB working set). The pick when RAM is at a premium and we are willing to live with a small set of pre-converted models.
3.4.1 Walkthrough — running a pre-converted SDXL bundle
The supported path is to take someone else’s CoreML conversion off the shelf. Self-conversion is its own adventure, covered in CoreML conversion in practice below; we won’t need it for this walkthrough.
- Install the app by downloading the latest
.dmgfrom the GitHub releases page and dragging it to Applications. The binary is unsigned outside the App Store route, so the first launch requires a right-click → Open to get past Gatekeeper. - Grab a pre-converted bundle. The CoreML format is not interchangeable with
safetensors; a CivitAI checkpoint simply won’t load. The community curates pre-converted bundles atcoreml-communityon HuggingFace — ~145 of them covering SD 1.5, SD 2.1, SDXL and its art-style forks (Animagine XL, AnythingXL, CounterfeitXL), plus ControlNets in both attention modes. For an official SDXL base,apple/coreml-stable-diffusion-xl-baseis the canonical source. - Pick the right attention mode. Each bundle ships in one of two flavours:
ORIGINALtargets CPU+GPU (Metal) and is the right choice for Macs;SPLIT_EINSUMtargets the Neural Engine and only makes sense on iPhone/iPad or for very small models. For SDXL on Mac, always pickORIGINAL— the SDXL UNet is too large for the Neural Engine (ANE) to hold the weights anyway. - Install the bundle by copying the bundle directory (the one containing
Resources/) into Mochi’s models folder — by default~/Documents/MochiDiffusion/models/. - Generate. The bundle now appears in Mochi’s model dropdown. The Neural Engine inference is fast and the RAM footprint stays tiny; we can run it alongside a heavier ComfyUI workflow without pushing the machine into swap.
The reasons not to pick this client for general work are stacked up against it: no LoRA loading, no native inpainting, no Flux dev/schnell, no Kontext, no Qwen, no SD3, no video, no direct CivitAI import. We come here when RAM headroom matters more than any of those. Self-conversion of fresh CivitAI checkpoints is doable for SDXL fine-tunes but rough; for anything Flux-class or newer, the converter doesn’t know the architecture and we should use Draw Things instead.
Backbones: SD 1.5, SD 2.x, SDXL, plus Flux 2 Klein (4B/9B distilled, pre-built bundles available).
Affordances: ControlNet, RealESRGAN upscaling, EXIF metadata preservation in generated PNGs.
4 CoreML conversion in practice
A practical aside, since Mochi Diffusion is the one client on this page that depends on a separate offline model-conversion step.
Don’t convert if you do not have to. Grab a pre-converted bundle from coreml-community (~145 community bundles covering SD 1.5, SD 2.1, SDXL forks like Animagine XL / AnythingXL / CounterfeitXL, ControlNets in both attention modes) or apple/coreml-stable-diffusion-xl-base for the official SDXL. That covers most things we want. Self-conversion if a chore for we have a fresh CivitAI checkpoint nobody has packaged yet and want to earn some internet points for being the one to do the work.
The conversion pipeline is in apple/ml-stable-diffusion. The repo is in maintenance, not development — last tagged release 1.1.1 was May 2024, last Apple commit late 2024, with one community PR landing July 2025. Supported architectures: SD 1.4/1.5, SD 2.0/2.1, SDXL (+refiner), SD3-medium, ControlNet (SD1.x only). The post-SDXL generation — Flux, SD 3.5, Wan, Qwen-Image, Lumina, Sana — has no Apple-blessed converter path and no community fork that covers it. For those, use Draw Things (its custom Swift+Metal stack handles them natively) or any PyTorch+MPS client.
4.1 The pipeline
The Mochi Diffusion repo vendors a wrapper around Apple’s converter. That’s the path the wiki documents and the only one that “just works”:
git clone https://github.com/MochiDiffusion/MochiDiffusion.git
cd MochiDiffusion/conversion
uv venv && source .venv/bin/activate
./download-script.sh
export MODEL_NAME=mySDXL
mv ~/Downloads/${MODEL_NAME}.safetensors .
# Phase 1: safetensors → diffusers (~1 min)
uv run python convert_original_stable_diffusion_to_diffusers.py \
--checkpoint_path ${MODEL_NAME}.safetensors --from_safetensors \
--pipeline_class_name StableDiffusionXLPipeline \
--device cpu --extract_ema --dump_path ${MODEL_NAME}_diffusers
# Phase 2: diffusers → Core ML (~20–40 min, ~16 GB RAM)
uv run python -m python_coreml_stable_diffusion.torch2coreml \
--xl-version --compute-unit CPU_AND_GPU \
--convert-vae-decoder --convert-vae-encoder \
--convert-unet --convert-text-encoder \
--model-version ${MODEL_NAME}_diffusers \
--bundle-resources-for-swift-cli \
--attention-implementation ORIGINAL \
-o ${MODEL_NAME}_original
# Mochi loads from: ${MODEL_NAME}_original/Resources/4.2 Attention-mode gotcha
ORIGINALtargets CPU + GPU (Metal). Pick this for Macs — best speed/quality on the M-series GPU. Pair with--compute-unit CPU_AND_GPU.SPLIT_EINSUMtargets the Neural Engine. Faster on iPhone/iPad and on entry M-series for SD 1.5; for SDXL on a Mac it’s slower because the model is too big for the ANE to hold the weights anyway.SPLIT_EINSUM_V2is a 10–30% mobile speedup; Apple’s own README warns SDXL compile times are “prohibitively long”. Skip on Mac.
Convert both modes only if we’re shipping the model to iPad too.
4.3 What breaks on a fresh CivitAI checkpoint
- SDXL fine-tune — works. Plan a quiet hour; expect one OOM retry on 16 GB; expect to add
--halfif it fails on shard size. (diffusers≥0.29 default-shards the SDXL UNet at 10 GB, which breaks the next stage — bumpmax_shard_sizeto 15 GB on line ~188 of the script. See Mochi issue #261.) - Flux / SD 3.5 / Wan / Qwen / anything else recent — don’t try. The converter doesn’t know those architectures. Use Draw Things instead.
5 The AUTOMATIC1111 family
The original AUTOMATIC1111 WebUI — feature-rich Gradio interface — was the canonical Stable Diffusion UI for years. Its main branch is effectively abandoned: last release v1.10.1 from Feb 2025, no support for Flux, SD3, Kontext, Qwen, or anything more recent than SDXL. Active development moved to a thicket of forks:
- Forge (lllyasviel) — the original fork. Sporadic.
- reForge (Panchovix) — stability-focused fork of Forge.
- Forge Neo / Forge Classic (Haoming02) — current best A1111-style pick: supports Flux, Qwen, Wan 2.2.
- SDNext (vladmandic) — actively maintained heavy refactor.
If we want the WebUI feel with current models, maybe try Forge Neo or SDNext? Let me know how it goes if you do. Vanilla A1111 is now of historical interest only. PyTorch+MPS for any of these; the well-trodden install path; the official MPS notes only cover the abandoned main branch.
If we use the Hugging Face tooling, building a local UI is easy; it integrates easily with gradio. See also nitrosocke/diffusers-webui.
DiffusionBee belongs in the same graveyard — the original drag-and-drop Mac client, native Apple Silicon, zero terminal. Last GitHub release was v2.5.3 in August 2024; superseded on every axis by Draw Things (same install ease, broader model coverage, actively maintained). Worth recognising in old blog posts and Reddit threads; not worth installing today.
6 Pipeline-chaining clients
6.1 ChaiNNer
A node-based GUI specifically for non-diffusion image-processing pipelines: Real-ESRGAN → GFPGAN → format conversion → batch over a folder. Electron app + Python sidecar with PyTorch+MPS; works on Apple Silicon. Maintained.
Pareto dominates when: we want repeatable upscale/restore pipelines without writing a script. Complementary to ComfyUI rather than competitive — if the task is diffusion, this is the wrong tool.
Affordances: Spandrel-supported PyTorch arches (super-resolution, face restore, denoise, JPEG-deartefact, dehaze, low-light, colourisation), plus NCNN/ONNX/TensorRT model files. Loads .pth, .pt, .safetensors, some .ckpt.
Source: chaiNNer-org/chaiNNer.
6.2 Spandrel
ChaiNNer’s Python core, also usable on its own. Spandrel on PyPI. Pick when we want ChaiNNer’s arch coverage from a script rather than a node graph.
Catalogue: ESRGAN family, SwinIR, HAT, Real-ESRGAN, AuraSR plus ~20 other super-resolution arches; GFPGAN/CodeFormer/RestoreFormer face restorers; LaMa/MAT inpainting; NAFNet/SCUNet/Restormer denoisers; DDColor; DeJPEG. No diffusion, no Flux, no Qwen.
7 Hosted-only services
When we don’t want to run anything locally — trade convenience for privacy and control.
7.1 Runway
Runway pivoted to video around 2024 and the image-generation product line is no longer a marketed offering. Current focus: Gen-4.5 video, GWM-1 world models, real-time video agents (Runway Characters). The Photoshop and Blender plugins still exist.
Pareto dominates when: we are doing video and world-model work; for image-only workflows there are cheaper and more focused options elsewhere.
7.2 Midjourney
Midjourney — currently spanning V7 and the V8 generation (V8 shipped 2026: claimed ~5× faster than V7, native 2K output, image and video). Browser/Discord interface; AFAICT still no public API. $10–$120/month subscription.
Pareto dominates when: we want the the prompt-craft aesthetic frontier. It is addictive in that we can get better at it, which feels like mastering a real skill. Unsuitable for programmatic pipelines because there’s no API.
8 Where models sit on disk
Model weights are the bulk of what these clients write to disk — tens to hundreds of GB — and all of it is re-downloadable. These directories matter for two chores: reclaiming space, and excluding them from Time Machine, since there is no point backing up a checkpoint that can be pulled again from Hugging Face.
Default locations on macOS:
| Client | Default model directory |
|---|---|
| Draw Things | ~/Library/Containers/com.liuliu.draw-things/Data/Documents/Models (overridable via DRAWTHINGS_MODELS_DIR) |
| InvokeAI | ~/invokeai/models |
| Mochi Diffusion | ~/Documents/MochiDiffusion/models |
| ComfyUI (source) | <clone>/models — wherever it was cloned |
| ComfyUI Desktop | a basePath chosen at install — read it from ~/Library/Application Support/ComfyUI/config.json |
| Forge Neo / SDNext / reForge | <clone>/models |
| ChaiNNer / Spandrel | no central directory; they reference model files in place |
The directory that hides the most is not a client at all: the shared Hugging Face cache at ~/.cache/huggingface, where diffusers and huggingface-cli downloads accumulate across every tool that uses them. PyTorch’s own cache (~/.cache/torch) holds auxiliary models such as upscalers.
To exclude a directory from Time Machine:
sudo tmutil addexclusion -p ~/.cache/huggingface
sudo tmutil addexclusion -p ~/invokeai/models
sudo tmutil addexclusion -p ~/Documents/MochiDiffusion/models
# Draw Things default container (skip if you set a custom dir and excluded that instead):
sudo tmutil addexclusion -p ~/Library/Containers/com.liuliu.draw-things/Data/Documents/Models
# confirm an exclusion took:
tmutil isexcluded ~/.cache/huggingfaceThe path must exist before addexclusion will accept it, and -p records a sticky exclusion keyed to the path string (right for fixed locations); the form without -p instead tags the folder so the exclusion rides along if it’s later moved.
