Editing images with AI

2018-10-16 — 2026-05-05

Wherein a Distinction Is Drawn Between Instruction-Following Editors and Single-Task ML Tools, With Attention to Specialised One-Trick Startups Now Rendered Redundant by General-Purpose Models.

computers are awful

generative art

making things

photon choreography

This page is about editing an existing image with ML — as opposed to generating one from a prompt. For text-to-image generation, see generative art with AI models; for the pre-diffusion neural-art lineage (DeepDream, GANs, CPPNs and friends), see the historical record. For the front-end software that runs these editing models locally — ComfyUI, InvokeAI, Draw Things, ChaiNNer and friends — see front-end clients for AI image models. For the drama behind FLUX.1 Kontext (the Stability AI → Black Forest Labs splinter), see AI democratization. For non-ML editing, see GUIs and image editing automation.

This list will rot. The state of the art in image-editing models moves in months, sometimes weeks. Hours if you are on the right Discord servers. For something current, the trending tab on Hugging Face, the news threads at r/StableDiffusion, and CivitAI are better than this notebook.

There was, around 2019–2023, a brief efflorescence of small ML companies doing one trick each — sharpen, upscale, remove a background, restore a face. Most have been bought up by Adobe, abandoned, or made redundant by general-purpose models. We will not eulogise them here. The survivors are below.

1 Instruction-following editors

One interesting category is image plus a text instruction in → edited image out. “Remove the lamppost.” “Make this a watercolour.” “Extend the canvas to 16:9 and fill the new space with sky.” The same pipeline that runs text-to-image now runs sideways, conditioned on the input image.

There are three flavours I care about:

Open-weights edit models that we can run on our own GPU or via API. Most flexible, slowest to set up.
Closed hosted edit endpoints from the big labs. Fast, cheap-per-call, opaque.
Frontier multimodal LLMs with image editing baked in. The chatbot we use for text now also edits images, often well, with the conversation history as implicit context.

Durable lineages:

FLUX.1 Kontext (Black Forest Labs, the post-Stability splinter that built the Flux family). Open-weights “dev” version (non-commercial), paid “pro” / “max” via API. The first open-weights edit model that doesn’t feel like a toy (Labs et al. 2025) — runs in ComfyUI / InvokeAI alongside generation pipelines. Mac: 12B parameters, so on a 16GB machine we want a Q4 GGUF (~7GB); 32GB Macs handle Q8 (~13GB); FP16 (~24GB) is tight even on 32GB. The native MLX port mflux is fast. The pick when scriptability or licensing matters more than convenience; for one-off edits the closed APIs are less faff.
Qwen-Image-Edit / Qwen-Image-Edit-2509 (Alibaba). Open-weights, strong on text-in-image edits and on Chinese-language prompts. The 2509 update is the one to grab as of this writing (Wu et al. 2025). Mac: ~20B-class; MPS works via qwen-image-mps, and a 4-step Lightning/Rapid variant exists for slow hardware. Realistically wants 32GB unified memory; on 16GB it swaps. The pick over Kontext when we specifically need text rendering or non-Latin scripts.
Gemini 2.5 Flash Image (“Nano Banana”) and its 2026 successor Gemini 3.1 Flash Image (“Nano Banana 2”, available via Vertex AI). Cheap, fast, notionally conservative about identity preservation — faces don’t drift much across edits, which is great for not terrifying our social brains. Nano Banana 2 (4K out, better text rendering) is a different beast and is probably the one to beat.
GPT-Image-1 and its successor GPT-Image-2 (OpenAI). Edit-mode endpoint of the same model that powers ChatGPT image generation. The chat-native pick when we already live in ChatGPT; the /images/edits API has been flaky in 2026 for programmatic use, so I would not build a pipeline around it.
Adobe Generative Fill inside Photoshop, on Adobe’s Firefly model. Most-deployed by a wide margin, because Photoshop. As of 2026, output bumped to 2K, Firefly Image 5 and Fill & Expand replace the retired Image 3, and the new partner-model picker puts FLUX and Gemini in the same dropdown as Firefly inside Photoshop.

2 Single-task tools

Even with instruction-following editors, a specialised tool is often cheaper, faster, or more predictable.

2.1 Background removal

remove.bg — hosted, fast, the brand-name option. No longer the cheapest — PhotoRoom and API4AI undercut it for bulk work.
Clipping Magic — hosted, marginally more configurable, with a nice user interface. The pick when we have a hard cutout (hair, fur, glass) and want manual scalpel/refinement tools.
BiRefNet — open-weights, scriptable, near-SOTA quality (Zheng et al. 2024). PyTorch + MPS works with no extra setup; ~3.5GB at 1024², trivial on a 16GB Mac.
RMBG-2.0 (Bria) — a BiRefNet derivative with proprietary training; current SOTA on difficult backgrounds per Bria’s own benchmark. Same Mac install footprint as BiRefNet. The licence is Bria RAIL-M — research and personal use only. The HF download page makes it look open; it isn’t.

2.2 Upscaling and super-resolution

Topaz Gigapixel — paid, the polished commercial option for photographic upscaling, especially faces and skin. Native Apple Silicon, just works.
Upscayl — free, open-source, drag-and-drop GUI wrapping the Real-ESRGAN family (Wang, Xie, et al. 2021) and community variants like 4x-UltraSharp; the broader catalogue lives at OpenModelDB. Real-ESRGAN’s upstream repo has been dormant for ~2 years, but the weights live on through Upscayl, ChaiNNer and ComfyUI; that’s how most people now use them. Upscayl runs Vulkan via MoltenVK on Apple Silicon — fast enough for occasional use, less efficient than a Metal-native tool.x

2.3 Object removal and inpainting

cleanup.pictures — quick browser tool for removing people, text, and small defects from a single image. Increasingly redundant against frontier multimodal editors and against Photoshop, but handy for one-offs. Local/scriptable counterpart: IOPaint (formerly lama-cleaner), which wraps the same LaMa backbone and runs on MPS.
For heavier inpainting workflows — mask + prompt, control over what fills the hole, regional generation — the local diffusion clients (ComfyUI, InvokeAI, Draw Things) all do this on top of Stable Diffusion / Flux / SDXL backbones.

2.4 Face restoration

GFPGAN (Wang, Li, et al. 2021) and CodeFormer (Zhou et al. 2022) are the standard open-weights face restorers, both in long-term maintenance and available as ChaiNNer nodes. For most cases now we let Kontext or Nano Banana fix the face as a side effect of any other edit.

2.5 Document cleanup, old-skool

ScanTailor Advanced does content-aware cropping, dewarping, and background removal of scanned documents. Strictly classical computer vision — no generative model — but the use case is alive and nothing generative replaces it for batch book-scan work. The actively-maintained ARM64 fork is the vigri branch (Qt6, 3D dewarp); the older yb85/scantailor-advanced-osx bundle still works for the legacy lineage.

3 References

Labs, Batifol, Blattmann, et al. 2025. “FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space.”

Soman. 2020. “GIMP-ML: Python Plugins for Using Computer Vision Models in GIMP.” arXiv:2004.13060 [Cs].

Wang, Li, Zhang, et al. 2021. “Towards Real-World Blind Face Restoration with Generative Facial Prior.” In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Wang, Xie, Dong, et al. 2021. “Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data.” In 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

Wu, Li, Zhou, et al. 2025. “Qwen-Image Technical Report.”

Zheng, Gao, Fan, et al. 2024. “Bilateral Reference for High-Resolution Dichotomous Image Segmentation.” CAAI Artificial Intelligence Research.

Zhou, Chan, Li, et al. 2022. “Towards Robust Blind Face Restoration with Codebook Lookup Transformer.” In Advances in Neural Information Processing Systems 35.