Generative art with language+diffusion models

also some autoregressive models

2022-09-16 — 2026-06-13

Wherein the Model Landscape Is Surveyed by Lineage, Licence and Output Character, and the Legal Treatment of AI-generated Images Is Tabulated Across Four Jurisdictions — Japan, Singapore, the EU and Australia.

buzzword
computers are awful
generative art
machine learning
making things
music
neural nets
photon choreography
UI
Figure 1

Generative art using modern diffusion-backed image generators. The name-brand models are DALL-E 2, Stable Diffusion, Midjourney etc, which are diffusion models for image generation + transformer models for the text-to-image part.

This page is about image generation — prompt to image, with a focus on the models and the model ecosystems. For editing existing images with ML — instruction-following editors (FLUX.1 Kontext, Qwen-Image-Edit, Nano Banana, etc.), background removal, upscaling, inpainting — see editing images with machine learning. For the front-end software that runs these models locally (ComfyUI, InvokeAI, Draw Things, ChaiNNer, …) see front-end clients for AI image models. For the community back-story behind Stable Diffusion, Black Forest Labs and the open-vs-corporate fault line, see AI democratization.

I’m interested in generative image models in general. I am particularly, practically interested in models that run locally — on my own machine and GPU rather than behind a hosted API — and ideally ones I can fine-tune or train myself. I happen to work on a Mac, so the macOS- and Apple-Silicon-specific runtime detail (which client, which quantization, how much RAM) is covered in the front-end clients notebook; this page stays about the models themselves, not the box they run on. I like using the community-trained models for specialization or jailbreaking.1 As with many other parts of AI, the community is incredible.

For audio stuff, see music diffusion.

1 Theory

For the maths, see neural denoising diffusion models; the pre-neural-diffusion lineage (DeepDream, GANs, CPPNs) is mostly of historical interest now. Some pointers for image-diffusion specifically:

The transformer has been applied to vision — the Vision Transformer cuts an image into patches and runs attention over them like word tokens (Dosovitskiy et al. 2021), building on earlier spatial-attention work (Zhu et al. 2019), later into classification backbones (Guo, Jia, and Bai 2022), and using self-supervised pretraining through masked autoencoders (He et al. 2021). Those ViT models were encoder/decoder models already. Today’s diffusion backbones (SD3, Flux, Qwen-Image’s MMDiT) also employ transformers internally, swapping the old convolutional U-Net denoiser for a transformer — the DiT, or diffusion transformer. Vision models without any diffusion at all are also a thing, generating image tokens autoregressively the way a language model emits text (e.g. Lumina-mGPT-2.0). I think diffusion continues to be acceptable since it still names the training objective rather than the architecture, but, like, I dunno you guys, it’s complicated.

2 Where to find generation models

Hugging Face is the heavy-hitter in neural networks generally and hosts most of the foundation generative image models. Generative art models additionally have the specialized community CivitAI, which hosts the long tail of community fine-tunes. The workhorse format is the LoRA (low-rank adaptation): a small parameter-efficient adapter — often under 300 MB — that bolts a style, character or idiom onto a base model without retraining it. CivitAI also hosts full fine-tunes — whole checkpoints retrained or merged for an aesthetic, Pony Diffusion V6 XL among them — which carry a style more thoroughly but run to several GB apiece and are far more of a chore to make and store. A LoRA is the convenient default — small, cheap, stackable — not the only way to specialize a base. For the back-story of both, see AI democratization.

Feature Hugging Face CivitAI
Focus Research-first platform (300,000+ models) Community-driven artistic hub
Model Types Stable Diffusion variants, ControlNet, LoRAs Artistic models (anime, photorealistic, 3D), fine-tuned LoRAs
Discovery Organized by pipeline tags and metrics Visual browsing with instant output previews
Documentation Comprehensive model cards with bias analysis User-generated examples and prompt sharing
Community Academic and ML practitioner oriented Artist and creator focused
Integration Native PyTorch/TensorFlow support, diffusers library Simple download format for GUIs like Draw Things
Content Policy Stricter content guidelines More permissive with NSFW filters
Traffic Research-focused userbase 25M+ monthly visits, 500+ new models daily
Ecosystem Central to ML research and deployment Popular for artistic workflows and style training

Most local clients (Draw Things, ComfyUI, Mochi Diffusion, …) import from both ecosystems.

3 Notable model lineages

Figure 2

The model landscape is fractured between corporate offerings (Flux, DALL-E 3, Midjourney — polish and ease, but advanced features behind APIs or subscriptions) and community-trained ones (SDXL, CivitAI LoRAs — customisation and local control, steeper learning curves). The full back-story of how this fracture happened is standard AI democratization drama.

Too many options? There are for me. I used the Vibe Check™ method to narrow it down. Which is to say, I asked an LLM to scan some community write-ups and CivitAI discussions. Citations needed.

FLUX.1 [dev] produces camera-accurate images — literal rather than atmospheric — and handles text legibility that was essentially impossible before this generation. Community comparisons consistently put it at or near the top for photorealism and prompt adherence among open-weight models. The cost is speed: roughly a minute per image locally without GGUF quantization. Non-commercial licence. The one I’d grab for publication-grade output.

FLUX.1 [schnell] is a distilled version of dev, Apache 2.0, and roughly 6× faster at a quality cost that most describe as noticeable but not catastrophic. Community practice has largely converged on schnell for iteration and dev (or a LoRA stack on dev) for finals. The SDXL vs Flux comparison at Stable Diffusion Art gives a sense of where schnell sits against the field.

FLUX.2 [klein] 4B generates in under a second on a current GPU and is viable for interactive brainstorming. Early community impressions put quality somewhere around Z-Image Turbo — good enough for ideation, not for portfolio. Apache 2.0; the 302.AI benchmark gives the most systematic quality breakdown currently available.

Ideogram (ex-Google Imagen team) renders text inside images — typography that art-forward model handles badly.

Z-Image Turbo (Alibaba Tongyi, 6B, Apache 2.0, November 2025) runs in 16 GB VRAM, renders English and Chinese text, and topped the open-source tier of the Artificial Analysis leaderboard — 8th overall at launch, ahead of much larger models. There is plenty of community around it in both the West and China: LoRAs, checkpoint forks and ControlNets on CivitAI, and on the Chinese platforms it has topped the ModelScope popularity charts.

Qwen-Image (Alibaba Tongyi, 20B MMDiT, Apache 2.0) is an open text-rendering champion, handling legible labels, infographics, posters, multi-line typography this probably helps. The current local weight, Qwen-Image-2512 (December 2025), tops the open-source field over 10,000 blind rounds of Alibaba’s own AI Arena (a vendor benchmark, so — a pinch of salt). The catch is weight: 20B and a Qwen2.5-VL text encoder mean heavy quantization or a big-RAM machine. This won’t run on a phone. The family includes Qwen-Image-Edit and Qwen-Image-Layered (generation straight to separable layers). A hosted (closed) Qwen-Image-2.0 followed in February 2026 with professional typography and native 2K. Where Z-Image Turbo’s goes for speed, this goes for text-and-realism.

SDXL and its fine-tune ecosystem remain relevant primarily because of ecosystem depth: the LoRA catalogue for SDXL is still estimated at five to ten times the size of FLUX.1’s, and specific aesthetics — a film stock, an illustrator’s hand, an art movement — are far more likely to exist as SDXL fine-tunes. On raw photorealism Flux wins; but if a specific look already exists as an SDXL LoRA, SDXL is turnkey.

Pony Diffusion V6 XL is the model for furries! It uses Danbooru/e621 booru tagging conventions — score_9, anthro, sad_expression, dramatic_pose — that give precise control over character features and emotional content. Output skews semi-anime even on realistic prompts; the palette is vibrant and linework clean; without the score tags, output dulls. The CivitAI model page and the Stable Diffusion Art write-up describe the tagging system in detail. And hey, you guys, furrydom is totally not all about sex, but guess what? This model can be quite nasty.

Midjourney V7/V8 is still a benchmark. It is tuned to be attractive before it is flexible: even loosely-specified prompts come back polished and idealized, and it will override a requested style in favour of something better-looking. The result is a recognizable house look that this LLM vibe-check would sum up as “luxury hotel lobby”: reliably beautiful, hard to push in a specific compositional direction, and the subject of an ongoing critique about aesthetic homogenization in AI art (McCormack et al. 2024). No API — Discord/browser only — which means no programmatic pipeline.

SD 1.5 is a legacy at this point, but its ControlNet and LoRA ecosystem is deep and does not port to SDXL. If a specific look exists only as an SD 1.5 fine-tune, there is often no practical alternative; otherwise SDXL or Flux are usually better choices.

3.1 Feature matrix

Model Licence 16 GB RAM Output character Speed (local) Text in image LoRA depth
FLUX.1 dev Non-commercial GGUF Q4–Q5 only Photorealistic, literal Slow Good Growing
FLUX.1 schnell Apache 2.0 GGUF Q5 fits Photorealistic, slightly softer Moderate Good Growing
FLUX.2 [klein] 4B Apache 2.0 Yes, fp16 ~8 GB Photorealistic, draft quality Fast Untested Nascent
Z-Image Turbo Apache 2.0 Yes, ~12 GB fp16 Photorealistic, portraits Very fast (8-step) Good (EN/CN) Growing fast
Qwen-Image 2512 Apache 2.0 GGUF, tight (20B) Photoreal, text-strong Slow (20B) Excellent (EN/CN) Growing
SDXL CreativeML OpenRAIL-M Yes, ~6.5 GB Fine-tune dependent Moderate Poor Very deep
Pony V6 XL CreativeML OpenRAIL-M Yes, ~6.5 GB Semi-anime / illustrative Moderate Poor Deep, booru-tagged
Midjourney V7/V8 Closed, API only No Opinionated, polished Fast (API) Moderate
SD 1.5 CreativeML OpenRAIL-M Yes, ~2 GB Soft, painterly Very fast Poor Deepest
Ideogram Closed, API only No Mixed Fast (API) Excellent

Speed is relative and hardware-dependent; published benchmarks usually run on a desktop NVIDIA GPU, which is faster than most laptop or Apple-Silicon setups. For live quality rankings — Elo scores from blind human votes, updated as new models ship — check the Artificial Analysis text-to-image leaderboard.

3.2 Structural control

A ControlNet lets us pin down the composition of an image — where a figure stands, the pose it strikes, the outline of a scene — separately from what the prompt asks for. We hand the model a “hint” image that encodes the structure we want, and it renders a finished picture that obeys it. This is how we get a character in a chosen pose, or a building with a given silhouette: spatial details more exact than we would describe with words.

The hint comes in a few flavours. A pose skeleton is a stick figure marking where the head, shoulders, elbows and knees go — read off a reference photo by a pose-detector like OpenPose (Cao et al. 2019), or posed by hand. An edge map traces the outlines of a reference as pale lines on black, either hard-edged (Canny, after the classic edge-detection algorithm) or soft-edge, a blurrier trace that leaves the model more room to interpret. A scribble is just a rough sketch we draw ourselves. Either way the model takes the hint as scaffolding it must follow, and invents everything else from the prompt.

Which base models support this? The technique was invented on Stable Diffusion 1.5, so that older model still has the widest menu of control types — if we want some unusual kind of conditioning, it most likely exists there first. Newer models caught up by bundling the common controls (pose, edges, depth) into a single all-in-one model:

  • SD 1.5 — the original, and still the most varied.
  • SDXL — community all-in-one models.
  • Flux.1 dev — an official set from Black Forest Labs (Flux Tools) and a community all-in-one.
  • Qwen-Image — gained its own all-in-one in late 2025.

A ControlNet is tied to a single base model, the way a LoRA is. The clients page walks through the workflow — ComfyUI is the most thorough, Draw Things the easiest on a Mac.

4 Resolution

Think in a megapixel budget rather than fixed dimensions. Each model has a native total-pixel count it was trained around, and that budget is spent across whatever aspect ratio we choose — at ~1 MP, 1024×1024, 1216×832 and 1344×768 are the same budget in different shapes. This size-and-aspect conditioning is something SDXL introduced (Podell et al. 2023), training across multiple aspect ratios rather than a single square.

Generating much past native resolution can produce artefacts — duplicated heads and limbs, repeated motifs. The SDXL paper (Podell et al. 2023) motivates its design from the resolution limits of the 512-native Stable Diffusion 1.x models.

A common route past native resolution is upscaling rather than bigger generation: render at native res, then enlarge with a tiled-diffusion pass or a dedicated ESRGAN-class upscaler.

Nonetheless there is variation in how robustly models support various resolutions natively:

  • SD 1.5 is 512-native and falls apart not far above 768.
  • SDXL and its fine-tunes (including Pony V6 XL) are 1024-native (Podell et al. 2023) (~1 MP), comfortable to roughly 1.5 MP.
  • FLUX.1 dev / schnell are about ~1 MP native and (in practice stay coherent to around 2 MP).
  • FLUX.2 dev generates up to 4 MP natively — Black Forest Labs’ explicit pitch is detail at that size without an upscale step.
  • Z-Image Turbo is trained at 1024 but generates to 2048×2048 (~4 MP) given enough memory.
  • Midjourney and Ideogram are fixed-output API services — preset sizes, not a megapixel dial.

4.1 Starting points by goal

  • Photorealistic output for publication → FLUX.1 dev (or dev + LoRA)
  • Fast iteration on 16 GB, no licence constraint → Z-Image Turbo, FLUX.1 schnell GGUF, or FLUX.2 [klein] 4B
  • Specific aesthetic that probably exists as a fine-tune → check SDXL on CivitAI first
  • Anime, creature design, dramatic poses, negative-affect content → Pony V6 XL, hopefully other fine-tunes
  • Text legible inside the image → Ideogram (API) or FLUX.1 dev, or the Alibaba models (Z-Image Turbo/Qwen-Image)
  • Just want something that looks good without thinking about it → Midjourney

4.2 The rest of the menu

Open Draw Things and the model list is loooong — names like LTX, Wan 2.2, Hunyuan, ERNIE, HiDream-I1, Cosmos. Most of that length is the menu collapsing three different kinds of model into one alphabetical list. Video models (Wan 2.2, LTX-Video, Hunyuan) are a separate modality. Cosmos is NVIDIA’s world foundation model for robotics and autonomous-vehicle simulation — it generates video, but as synthetic training data for embodied AI rather than as art.

Among the image models proper, two more have ecosystems forming around them. HiDream-I1 (HiDream.ai, 17B, MIT) topped the Artificial Analysis board for a stretch of 2025 and has a modest but active LoRA following, with the lineage continuing in the O1 successor. ERNIE-Image (Baidu, 8B, Apache 2.0) is newer still — it occupies the text-rendering niche and picked up ComfyUI workflows and community quantizations within days of release.

One prominent name is absent from the community hype: Stable Diffusion 3 / 3.5. SD3’s June 2024 launch went badly — weak anatomy, and a licence so restrictive that CivitAI temporarily banned all SD3 resources — and although Stability later revised the licence and shipped the improved 3.5, the community had already decamped to Flux. In 2026 its LoRA ecosystem is still thin beside SDXL or Flux.

5 Niche fine-tunes and LoRAs

A working menu of specialist fine-tunes and LoRAs, weighted toward 1) idioms the default checkpoints render badly and 2) my interests.

5.1 Historical engravings and prints

  • YFG Albrecht Dürer Engraving Style (Flux dev) — captures some of Dürer’s burin line work (swelling/tapering strokes, cross-hatched form modelling) rather than the muddy “old print” pastiche most engraving LoRAs settle for.
  • WOOD ENGRAVING Style for FLUX (Flux dev) — clean large-block woodcut, not mezzotint sludge. More Gustave Doré than Piranesi.
  • gokaygokay/Flux-Engrave-LoRA (Flux dev) — gokaygokay is a well-known HuggingFace style-LoRA trainer with consistent technical quality.

5.2 Scientific diagrams

  • YFG Patents (Flux dev) — trained on patent figures with numbered callouts, dashed hidden lines, exploded views. Most “blueprint” LoRAs only render whole machines; this one captures the schematic call-out idiom itself.
  • Anatomica: Chalk (Flux dev) — anatomical chalk-plate aesthetic, and the creator notes the style transfers to non-anatomical subjects, so we get “Vesalius treats a toaster”.
  • Century Botanical Illustration (Flux dev) — 18th-century plate aesthetic (stipple shading, latin labels, ivory ground). Almost every “botanical” LoRA returns modern watercolour; this one does engraved plates.
  • Anatomical Surrealism (Flux dev) — Haeckel-meets-Da-Vinci hybrid of plate-style anatomy with mechanical/botanical.

Homework for the reader: scientific-diagram coverage is thinner than one might hope. There is room here for a fine-tune on Ramón y Cajal-style neural plates or period microscopy. And every option above is built on Flux.1 dev — there is still no Qwen-Image patent or diagram LoRA. This is an odd gap given that Qwen’s text rendering is exactly what a callout-laden schematic wants. The raw material is sitting in the open: DeepPatent2 is 2.7M public-domain patent drawings already labelled with object names, so a Qwen patent LoRA is waiting for whoever trains it first — a cloud-GPU job, since Draw Things’ trainer doesn’t reach Qwen.

5.3 Emoji and stickers

  • fofr/sdxl-emoji (SDXL) — the Apple-emoji LoRA. Lightweight, well-known, transparent-bg-friendly via post-matte.
  • starsfriday/Kontext-Emoji-LoRA (FLUX.1-Kontext-dev) — different value proposition: style transfer to emoji. Feed a portrait, get the emoji version of that person. Kontext-native.

There is, as far as I can tell, no great isolated-icon-with-transparency model right now. Generate on a flat colour and matte out with rembg, BiRefNet, or SAM as a separate step. Surely Qwen-Image-Layered should be perfect for this?

5.4 Other specialist idioms

6 How do I download and use that cool model I found?

  • Hugging Face: model cards specify the format; most image models ship as safetensors and drop directly into ComfyUI’s models/ tree or Draw Things’ model manager. Format conversion docs cover the diffusers-to-safetensors and back paths for programmatic use.

  • CivitAI: most .safetensors checkpoints and LoRAs import directly into Draw Things (URL paste) or ComfyUI (drop in the right subdirectory). On a Mac, filtering by “macOS-optimized” or “CoreML” tags finds Mochi Diffusion-compatible bundles. The clients notebook has step-by-step walkthroughs for each client.

7 Fine-tuning image models

Fine-tuning a diffusion model is much like fine-tuning any other foundation model, except the goal is to specialise rather than to realign behaviour: we start from a pretrained base that can already make images — just not quite the ones we want — and nudge it towards a particular style, character or concept on a small curated dataset of our own. The machinery is identical to the LLM case: usually a LoRA or other parameter-efficient adapter over a frozen base, occasionally a full fine-tune — retraining the whole damned thing into a fresh multi-gigabyte checkpoint — when we want the style baked in deeper. For the kind of specialisation we want here, a LoRA is almost always enough.

The recipe:

  1. Curate a small dataset — surprisingly small, commonly 10–30 images for a style, more for a character or concept.
  2. Caption each image — one text label apiece (more below).
  3. Descend a few hundred to a couple of thousand gradient steps, choosing a rank (the LoRA’s dimension) that trades capacity against file size.

The counts are low because the base already knows how to draw — we are teaching one new association, not the whole skill. Most of the craft is in dataset curation and captioning, not the hyperparameters.

7.1 Captioning

A text-to-image model was trained on (image, caption) pairs, so to teach it something new we feed it more of the same. The caption does two jobs.

First, it is the handle: the words we type later to summon what we trained, usually a rare trigger token — ohwx woman, sks dog — that means nothing to the base model yet, so the new concept has no preconceptions attached to it.

Second, it tells the trainer what to ignore. Whatever we name in the caption that the model already understands we can factor out; the residual we leave unnamed is what gets absorbed into the LoRA. This is the lever for controlling what the LoRA actually learns: for a style we caption the content and let the style be the residual; for a character we caption the pose, outfit and background so the character identity attaches to the trigger token. This is some weird multimodal voodoo.

Which dialect of caption to use depends on the base model:

  • Natural-language captions — full sentences, “a photograph of a woman standing in a field at dusk” — for Flux, SD3.5, Qwen-Image, anything with a strong text encoder.
  • Booru tags — comma-separated danbooru-style tags, 1girl, standing, field, sunset — for the SD 1.5 / SDXL / Pony anime lineage, the score_9, anthro, … convention Pony uses.

We rarely type these from scratch: bootstrap with a VLM (a captioner like BLIP, or a “WD14 tagger” for booru tags) and hand-correct.

7.2 Training software

The mature trainers are CUDA-first command-line tools (most with a GUI wrapper). These can be run locally, but usually we rent a GPU by the hour on a cloud provider.

  • ai-toolkit (ostris) — the current default for Flux, FLUX.2 and Qwen-Image; config-file driven with an optional web UI. What most current tutorials assume.
  • kohya-ss/sd-scripts — a highly configurable veteran working with SD 1.5, SDXL and Flux; most of the older training GUIs are wrappers around it.
  • DiffSynth-Studio (ModelScope) — Qwen-Image-first, with a layer-by-layer offload path that makes it feasible to train large models on consumer devices, somehow.
  • OneTrainer — seems very approachable: a proper desktop GUI covering LoRAs, full fine-tunes and embeddings, with dataset and labelling built in, and support for a wide range of models.
  • SimpleTuner — seems optimised for large datasets and multiple GPUs, production-scale runs rather than a quick character LoRA.

7.3 Where to run it

Training is hungrier than inference. The VRAM bar is high — Flux dev LoRA training wants roughly 24 GB and is comfortable at 48 GB, while FLUX.2 dev needs an 80 GB-class card — so the options run roughly in order of how much of the stack we manage ourselves:

On a Mac the on-device option is Draw Things’ LoRA training, which covers SDXL, Flux, Kwai Kolors and SD3.5 but not Qwen-Image or Z-Image. Even a large-memory Mac usually trains in the cloud anyway, because the mature trainers want CUDA.

8 Creative latitude

Getting one of these models to make something dark — a mushroom cloud for an anti-war poster, a face that stays sad, a truly threatening creature, violence of any kind — is hard. The models “want” to stay in an eternally bland equilibrium of mild happiness, and far more so than human artists. If I want to illustrate a book about the horrors of war, I’m pretty much fucked. The models exist in a world without sex, drugs, violence, or even strong emotion — a world of cute animals, pretty landscapes, and smiling people. There are four obstacles to maximal creative liberty:

  1. The weights are post-trained towards bland-and-pretty.
  2. The client runs a filter over the finished image.
  3. The licence governs what we may do with the model and its output.
  4. The law where we work governs what we may train on and publish.

The first two are what the community means by “uncensoring”; the last two govern whether we may uncensor the models, or even use them at all.

8.1 Bypassing the blandness post-training

The blandness in layer 1 comes from post-training. Most production-facing models get an RLHF (reinforcement learning from human feedback) or DPO (direct preference optimization) (Wu et al. 2026) pass that aligns outputs to human preference ratings — Diffusion-DPO (Wallace et al. 2023) is a representative public example of the technique. Those ratings correlate with positive affect and conventional aesthetics (McCormack et al. 2024), and against content that disquieted US-market human reviewers during training, so the models drift towards cute and pretty, emotionally beige. The same base weights with different post-training produce different behaviour — which is part of why the open ecosystem exists: edgier, more expressive options are very much in demand.

The positive-affect bias shows up most on underspecified prompts. This has theoretical backing in the preference alignment literature (Wallace et al. 2023) but AFAICT has not been studied specifically for image emotional valence. Community prompt engineering guides report that explicit physical descriptors work better than emotional labels: “tears streaming, jaw clenched, hollow eyes” rather than “sad character.” Adding positive-affect terms to the negative prompt — happy, smiling, cheerful, resolved, peaceful — is a commonly-reported workaround; the mechanism would be steering the sampler away from the part of latent space reinforced by positive preference ratings.

A fine-tune trained on raw character or creature art, without the RLHF passes of a production model, keeps a wider emotional and compositional range — negative affect, dramatic poses, non-cute creatures. All of this is content a default checkpoint would render as cheerful regardless of the prompt. Pony Diffusion V6 XL is the canonical example of a model optimized for drama (and, tbh, horniness).

8.2 Remove the safety checker

The second layer is a filter in the client pipeline — typically a CLIP-based classifier (CLIP being the image-text matching model that scores how well an image matches a text label) that fires after the image is generated and blacks out or blurs the result if it is too edgy. Unlike the weights, this is client software and comes off easily: in diffusers-based pipelines the safety checker is a separately-loaded model component that can simply not be loaded; it is not part of the generation weights.

The same filter is what fires on politically or historically charged prompts — mushroom clouds, battle scenes, burned buildings. It acts on the output image, not the prompt, and the model weights themselves often carry no corresponding restriction. Framing such prompts in historical or clinical registers (“nuclear test, Bikini Atoll, 1952 archival photograph, monochrome, fallout plume”) is anecdotal community knowledge — I have seen it reported to help bypass output classifiers, but have no systematic evidence for it, and how durable it is against classifier updates is unknown.

8.3 Licensing

Permissive licensing. Latitude in the licence is a different axis from latitude in the weights, and three regimes are in play:

  • Apache 2.0 (FLUX.1 schnell, FLUX.2 [klein] 4B, Z-Image Turbo) places no restriction on subject matter at all.
  • OpenRAIL — and its SD-family variant CreativeML OpenRAIL-M, which Stable Diffusion 1.5 used and Pony inherits — keeps the weights open but attaches a use-restriction clause prohibiting a listed set of harmful applications (child sexual abuse material, certain disinformation uses). It restricts what the model is used for, not who uses it.
  • Non-commercial (FLUX dev, FLUX.2 [klein] 9B and [dev]) restricts commercial use of the weights, and Black Forest Labs ships a separate content-filter module alongside — often conflated with the weights, but distinct from them.

So the most permissively-licensed open weights are the Apache-2.0 ones, while the most permissive in content are the RLHF-skipped fine-tunes — and those need not be the same model.

HuggingFace hosts relatively anodyne fine-tunes, while CivitAI hosts the edgier long tail. Most uploads inherit their backbone’s licence, so an SD-family fine-tune is usually CreativeML OpenRAIL-M whether or not its page says so.

9 Jurisdiction

The legal landscape for training on copyrighted images, running inference, and publishing outputs differs across jurisdictions. What follows is a snapshot as of mid-2026 in jurisdictions I care about; case law is thin in all four regimes and several questions are actively contested.

Useful legal term of art: a TDM (text and data mining) exception is the copyright carve-out that lets a model train on in-copyright works without first licensing them.

Japan Singapore EU Australia
Training on copyrighted images Article 30-4, Copyright Act — non-waivable safe harbour for “information analysis / pattern recognition” uses; covers commercial and non-commercial Section 244, Copyright Act 2021 — non-waivable computational data analysis exception; scope is “freely available” works DSM Directive 2019/790 Arts. 3–4 TDM exception; Article 4 allows rightholders to opt out via machine-readable reservation No TDM exception — government rejected Productivity Commission proposal in October 2025; s.40 Copyright Act 1968 fair dealing for research is narrow and enumerated; ML training likely reproduction infringement
Running inference / publishing outputs Agency for Cultural Affairs 2024 guidance notes Art. 30-4 may not cover inference outputs that “recreate the enjoyment” of a specific creator’s style — contested; no court ruling Fair use (Section 190); transformative purpose is one of four factors; no case law specific to AI image outputs yet Watermarking and AI disclosure mandatory from 2 August 2026 under AI Act Art. 50; parody defence narrow and member-state-dependent Inference outputs generally do not reproduce training data (sufficiently transformed); s.41A fair dealing for parody/satire available but narrow — copyright material must itself be satirised (Universal Music v Palmer [2021] FCA 434); no AI-specific case law
Copyright in AI output Emerging interpretation: sufficiently detailed prompts may create copyrightable output; no settled case law Human authorship requirement means purely AI-generated outputs are not automatically copyrightable No unified EU position; most member states treat purely AI-generated output as uncopyrightable; contested at the CJEU (Court of Justice of the EU) level No copyright without human authorship (s.32(3) Copyright Act 1968); no computer-generated works provision (unlike UK CDPA s.9(3)); government consulting on reform, no timeline
Content restrictions No AI-specific content ban; general obscenity law applies; April 2026 Justice Ministry panel examining deepfakes May 2026 Online Criminal Harms Act guidelines tighten enforcement on AI-generated NCII (non-consensual intimate imagery) of real persons; no restriction on artistic/research content not depicting real persons AI Act Art. 50: machine-readable watermark + disclosure mandatory on all synthetic images from 2 August 2026 Online Safety Act 2021 governs published content; Criminal Code Amendment (Deepfake Sexual Material) Act 2024 creates federal offences for non-consensual intimate deepfakes (up to 6 years); SA, NSW, QLD state laws adding further restrictions; locally-generated unpublished content generally unregulated
Parody / satire Not a statutory exception; court discretion; no settled case law for AI-generated parody Transformativeness is a fair-use factor; research/commentary purpose can qualify; no AI-specific precedent Narrow member-state exceptions; not harmonized; courts have not addressed AI-generated parody intent Specific s.41A fair dealing exception (added 2006); test from Universal Music v Palmer [2021]: the copyright material itself must be the target of parody; human creative intent required; untested for AI-generated parody

Japan: The interpretive question is whether inference “for the enjoyment of the expression” falls outside Art. 30-4’s scope. The 2024 Agency for Cultural Affairs guidance is advisory, not binding case law. Publisher suits against AI companies (Asahi/Mainichi v. Perplexity, filed September 2025) may produce clearer precedent; outcomes are pending as of this writing.

Singapore: Section 244 was modelled on Japan’s Art. 30-4 and is explicitly non-waivable — copyright owners cannot eliminate it via their terms of service (ToS), which matters when using images from platforms with “no AI training” clauses. The May 2026 Online Criminal Harms Act guidelines target AI-generated NCII of real persons specifically, not artistic or scientific generation.

EU: C2PA metadata (Content Authenticity Initiative) is the emerging standard for Art. 50 compliance — machine-readable provenance embedded at generation time. The c2patool CLI is the open-source implementation. Non-compliance from 2 August 2026 attracts fines up to €35M or 7% of global annual revenue.

Australia: The most restrictive of these four jurisdictions for training — no TDM exception and no general fair use doctrine, only enumerated fair dealing categories. For published research outputs, the deepfake provisions in the Criminal Code and state laws are expanding rapidly (three state-level laws passed between November 2025 and early 2026); the common thread across all of them is that they target distribution of intimate imagery involving real persons, not research generation. The s.41A parody exception exists and has been tested (Universal Music v Palmer), but courts will want evidence of human satirical intent, which creates an awkward question for AI-generated outputs where the human’s role was primarily prompt engineering.

9.1 Hygiene

Several things seem to reduce legal exposure, per practitioners and lawyers in this area.

Documenting purpose before generating — a paragraph in a lab notebook or file header — creates contemporaneous evidence of intent. Fair-use and fair-dealing analysis in all four jurisdictions makes purpose and intent explicit factors.

Keeping prompt logs for published work has become a practical precaution. Courts have begun ordering preservation of AI generation records specifically: the SDNY’s May 2025 order in New York Times v. OpenAI required OpenAI to preserve “all output log data on a going forward basis”; a National Law Review analysis (Feb 2026) notes that GenAI prompts, outputs, and logs are now treated as discoverable ESI (electronically stored information) under standard rules. Retrospective reconstruction is harder to defend than contemporaneous records.

Embedding C2PA provenance metadata at generation time — mandatory in the EU from August 2026 — is useful elsewhere for the same reason: it makes the origin of an image legible without relying on the viewer to trust a caption. Draw Things and some ComfyUI node packs support it; the c2patool CLI works on any file.

Distributing NSFW or dramatically charged research imagery only within a research team or behind access controls is a common precaution. All four jurisdictions hinge liability primarily on distribution, not generation, which means the same image can be lower-risk as an unpublished research artefact than as a blog illustration.

Identifiable real persons in embarrassing or intimate contexts carry civil and — in some jurisdictions — criminal exposure regardless of artistic intent.

If our work may reach multiple jurisdictions, using EU Art. 50 requirements as a floor (watermark + human-readable “AI-generated” disclosure) satisfies all four regimes simultaneously.

10 References

Cao, Hidalgo, Simon, et al. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields.”
Chu, Tian, Wang, et al. 2021. Twins: Revisiting the Design of Spatial Attention in Vision Transformers.” In Advances in Neural Information Processing Systems.
Dhariwal, and Nichol. 2021. Diffusion Models Beat GANs on Image Synthesis.” arXiv:2105.05233 [Cs, Stat].
Dosovitskiy, Beyer, Kolesnikov, et al. 2021. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.”
Dutordoir, Saul, Ghahramani, et al. 2022. Neural Diffusion Processes.”
Guo, Jia, and Bai. 2022. Transformer Based on Channel-Spatial Attention for Accurate Classification of Scenes in Remote Sensing Image.” Scientific Reports.
Han, Zheng, and Zhou. 2022. CARD: Classification and Regression Diffusion Models.”
He, Chen, Xie, et al. 2021. Masked Autoencoders Are Scalable Vision Learners.”
Ho, Jain, and Abbeel. 2020. Denoising Diffusion Probabilistic Models.” In Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS ’20.
Hoogeboom, Gritsenko, Bastings, et al. 2021. Autoregressive Diffusion Models.” arXiv:2110.02037 [Cs, Stat].
McCormack, Llano, Krol, et al. 2024. No Longer Trending on Artstation: Prompt Analysis of Generative AI Art.”
Nichol, and Dhariwal. 2021. Improved Denoising Diffusion Probabilistic Models.” In Proceedings of the 38th International Conference on Machine Learning.
Podell, English, Lacey, et al. 2023. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis.”
Sohl-Dickstein, Weiss, Maheswaranathan, et al. 2015. Deep Unsupervised Learning Using Nonequilibrium Thermodynamics.”
Song, Yang, and Ermon. 2020a. Generative Modeling by Estimating Gradients of the Data Distribution.” In Advances In Neural Information Processing Systems.
———. 2020b. Improved Techniques for Training Score-Based Generative Models.” In Advances In Neural Information Processing Systems.
Song, Jiaming, Meng, and Ermon. 2021. Denoising Diffusion Implicit Models.” arXiv:2010.02502 [Cs].
von Platen, Patil, Lozhkov, et al. 2022. Diffusers: State-of-the-Art Diffusion Models.”
Wallace, Dang, Rafailov, et al. 2023. Diffusion Model Alignment Using Direct Preference Optimization.”
Wu, Si, Xing, et al. 2026. Preference Alignment on Diffusion Models: A Comprehensive Survey for Image Generation and Editing.” Computer Science Review.
Yang, Zhang, Song, et al. 2023. Diffusion Models: A Comprehensive Survey of Methods and Applications.” ACM Computing Surveys.
Zhu, Cheng, Zhang, et al. 2019. An Empirical Study of Spatial Attention Mechanisms in Deep Networks.” In.

Footnotes

  1. My on-ramp was Adventures in Finetuning Stable Diffusion — Pokémon fine-tuning, 2022.↩︎