Editing images with AI
2018-10-16 — 2026-05-05
Wherein a Distinction Is Drawn Between Instruction-Following Editors and Single-Task ML Tools, with Attention to Specialised One-Trick Startups Now Rendered Redundant by General-Purpose Models.
This page is about editing an existing image with ML — as opposed to generating one from a prompt. For text-to-image generation, see generative art with AI models; for the pre-diffusion neural-art lineage (DeepDream, GANs, CPPNs and friends), see the historical record. For the front-end software that runs these editing models locally — ComfyUI, InvokeAI, Draw Things, ChaiNNer and friends — see front-end clients for AI image models. For the drama behind FLUX.1 Kontext (the Stability AI → Black Forest Labs splinter), see AI democratization. For non-ML editing, see GUIs and image editing automation.
This list will rot. The state of the art in image-editing models moves in months, sometimes weeks. Hours if you are on the right Discord servers. For something current, the trending tab on Hugging Face, the news threads at r/StableDiffusion, and CivitAI are better than this notebook.
There was, around 2019–2023, a brief efflorescence of small ML companies doing one trick each — sharpen, upscale, remove a background, restore a face. Most have been bought up by Adobe, abandoned, or made redundant by general-purpose models. We will not eulogise them here. The survivors are below.
1 Instruction-following editors
One interesting category is image plus a text instruction in → edited image out. “Remove the lamppost.” “Make this a watercolour.” “Extend the canvas to 16:9 and fill the new space with sky.” The same pipeline that runs text-to-image now runs sideways, conditioned on the input image.
There are three flavours I care about:
- Open-weights edit models that we can run on our own GPU or via API. Most flexible, slowest to set up.
- Closed hosted edit endpoints from the big labs. Fast, cheap-per-call, opaque.
- Frontier multimodal LLMs with image editing baked in. The chatbot we use for text now also edits images, often well, with the conversation history as implicit context.
Durable lineages:
- FLUX.1 Kontext (Black Forest Labs, the post-Stability splinter that built the Flux family). Open-weights “dev” version (non-commercial), paid “pro” / “max” via API. The first open-weights edit model that doesn’t feel like a toy (Labs et al. 2025) — runs in ComfyUI / InvokeAI alongside generation pipelines. Mac: 12B parameters, so on a 16GB machine we want a Q4 GGUF (~7GB); 32GB Macs handle Q8 (~13GB); FP16 (~24GB) is tight even on 32GB. The native MLX port mflux is fast. The pick when scriptability or licensing matters more than convenience; for one-off edits the closed APIs are less faff.
- Qwen-Image-Edit / Qwen-Image-Edit-2509 (Alibaba). Open-weights, strong on text-in-image edits and on Chinese-language prompts. The 2509 update is the one to grab as of this writing (Wu et al. 2025). Mac: ~20B-class; MPS works via qwen-image-mps, and a 4-step Lightning/Rapid variant exists for slow hardware. Realistically wants 32GB unified memory; on 16GB it swaps. The pick over Kontext when we specifically need text rendering or non-Latin scripts.
- Gemini 2.5 Flash Image (“Nano Banana”) and its 2026 successor Gemini 3.1 Flash Image (“Nano Banana 2”, available via Vertex AI). Cheap, fast, notionally conservative about identity preservation — faces don’t drift much across edits, which is great for not terrifying our social brains. Nano Banana 2 (4K out, better text rendering) is a different beast and is probably the one to beat.
- GPT-Image-1 and its successor GPT-Image-2 (OpenAI). Edit-mode endpoint of the same model that powers ChatGPT image generation. The chat-native pick when we already live in ChatGPT; the
/images/editsAPI has been flaky in 2026 for programmatic use, so I would not build a pipeline around it. - Adobe Generative Fill inside Photoshop, on Adobe’s Firefly model. Most-deployed by a wide margin, because Photoshop. As of 2026, output bumped to 2K, Firefly Image 5 and Fill & Expand replace the retired Image 3, and the new partner-model picker puts FLUX and Gemini in the same dropdown as Firefly inside Photoshop.
2 Single-task tools
Even with instruction-following editors, a specialised tool is often cheaper, faster, or more predictable.
2.1 Background removal
- remove.bg — hosted, fast, the brand-name option. No longer the cheapest — PhotoRoom and API4AI undercut it for bulk work.
- Clipping Magic — hosted, marginally more configurable, with a nice user interface. The pick when we have a hard cutout (hair, fur, glass) and want manual scalpel/refinement tools.
- BiRefNet — open-weights, scriptable, near-SOTA quality (Zheng et al. 2024). PyTorch + MPS works with no extra setup; ~3.5GB at 1024², trivial on a 16GB Mac.
- RMBG-2.0 (Bria) — a BiRefNet derivative with proprietary training; current SOTA on difficult backgrounds per Bria’s own benchmark. Same Mac install footprint as BiRefNet. The licence is Bria RAIL-M — research and personal use only. The HF download page makes it look open; it isn’t.
2.2 Upscaling and super-resolution
- Topaz Gigapixel — paid, the polished commercial option for photographic upscaling, especially faces and skin. Native Apple Silicon, just works.
- Upscayl — free, open-source, drag-and-drop GUI wrapping the Real-ESRGAN family (Wang, Xie, et al. 2021) and community variants like 4x-UltraSharp; the broader catalogue lives at OpenModelDB. Real-ESRGAN’s upstream repo has been dormant for ~2 years, but the weights live on through Upscayl, ChaiNNer and ComfyUI; that’s how most people now use them. Upscayl runs Vulkan via MoltenVK on Apple Silicon — fast enough for occasional use, less efficient than a Metal-native tool.x
2.3 Object removal and inpainting
- cleanup.pictures — quick browser tool for removing people, text, and small defects from a single image. Increasingly redundant against frontier multimodal editors and against Photoshop, but handy for one-offs. Local/scriptable counterpart: IOPaint (formerly lama-cleaner), which wraps the same LaMa backbone and runs on MPS.
- For heavier inpainting workflows — mask + prompt, control over what fills the hole, regional generation — the local diffusion clients (ComfyUI, InvokeAI, Draw Things) all do this on top of Stable Diffusion / Flux / SDXL backbones.
2.4 Face restoration
GFPGAN (Wang, Li, et al. 2021) and CodeFormer (Zhou et al. 2022) are the standard open-weights face restorers, both in long-term maintenance and available as ChaiNNer nodes. For most cases now we let Kontext or Nano Banana fix the face as a side effect of any other edit.
2.5 Document cleanup, old-skool
ScanTailor Advanced does content-aware cropping, dewarping, and background removal of scanned documents. Strictly classical computer vision — no generative model — but the use case is alive and nothing generative replaces it for batch book-scan work. The actively-maintained ARM64 fork is the vigri branch (Qt6, 3D dewarp); the older yb85/scantailor-advanced-osx bundle still works for the legacy lineage.

