Neural denoising diffusion models of language
2025-03-12 — 2026-06-15
Wherein Bidirectional Attention Is Shown to Enable Native Text Infilling, and DiffusionGemma Is Presented as a Concrete Instantiation of Discrete Diffusion Over a Jointly Denoised Token Canvas.
Neural diffusion models, but for generating words instead of pictures. A special kind of discrete diffusion.
1 DiffusionGemma
DiffusionGemma is Google’s open-weights diffusion model, built on the Gemma 4 mixture-of-experts. Where an autoregressive model emits one token at a time, DiffusionGemma fixes a block of tokens — a “canvas” — and refines that block jointly over its denoising steps. At least, at the decoder stage: an autoregressive encoder reads and caches the prompt. The diffusion decoder denoises the generation canvas under bi-directional attention.
The bi-directional attention is kinda cool; AFAICT it is non-causal attention, since the whole canvas is generated at once. An autoregressive model factorises \(p(x) = \prod_t p(x_t \mid x_{<t})\) under a causal mask, so each token attends only to its past. Discrete diffusion has no such constraint (at least, not within a single canvas). A forward process corrupts a clean sequence \(x_0\) into an increasingly noised/stochastically masked \(x_t\) and the model learns a reverse denoiser \(p_\theta(x_0 \mid x_t)\) that predicts the missing positions jointly. A token in the middle conditions on both its left and its right. That makes infilling and edits in the middle of a document a native capability.
I am somewhat interested in what this suggests for style transfer, where we hold a passage’s structure and meaning fixed and change only its register. (Lyu et al. 2023) train a diffusion model for fine-grained text style transfer on StylePTB, and (Zhang et al. 2025) use the same trick for flexible-length infilling. DiffusionGemma seems tunable, with LoRA recipes via Unsloth, Google’s Hackable Diffusion JAX toolbox, and NVIDIA NeMo. Runs on a Mac.
NB, diffusions look like an awkward fit for coding interfaces.
2 Incoming
- Diffusion Beats Autoregressive in Data-Constrained Settings (Prabhudesai et al. 2025)
- Inception Labs’ Mercury is the commercial cousin: a diffusion LLM family (Labs et al. 2025) now shipping as Mercury 2 and served on Azure AI Foundry.
- GitHub - dvruette/gidd: Code accompanying the paper “Generalized Interpolating Discrete Diffusion”
