Neural denoising diffusion models of language

2025-03-12 — 2026-06-15

Wherein Bidirectional Attention Is Shown to Enable Native Text Infilling, and DiffusionGemma Is Presented as a Concrete Instantiation of Discrete Diffusion Over a Jointly Denoised Token Canvas.

approximation

Bayes

generative

Monte Carlo

neural nets

optimization

probabilistic algorithms

probability

score function

statistics

Neural diffusion models, but for generating words instead of pictures. A special kind of discrete diffusion.

1 DiffusionGemma

DiffusionGemma is Google’s open-weights diffusion model, built on the Gemma 4 mixture-of-experts. Where an autoregressive model emits one token at a time, DiffusionGemma fixes a block of tokens — a “canvas” — and refines that block jointly over its denoising steps. At least, at the decoder stage: an autoregressive encoder reads and caches the prompt. The diffusion decoder denoises the generation canvas under bi-directional attention.

The bi-directional attention is kinda cool; AFAICT it is non-causal attention, since the whole canvas is generated at once. An autoregressive model factorises \(p(x) = \prod_t p(x_t \mid x_{<t})\) under a causal mask, so each token attends only to its past. Discrete diffusion has no such constraint (at least, not within a single canvas). A forward process corrupts a clean sequence \(x_0\) into an increasingly noised/stochastically masked \(x_t\) and the model learns a reverse denoiser \(p_\theta(x_0 \mid x_t)\) that predicts the missing positions jointly. A token in the middle conditions on both its left and its right. That makes infilling and edits in the middle of a document a native capability.

I am somewhat interested in what this suggests for style transfer, where we hold a passage’s structure and meaning fixed and change only its register. (Lyu et al. 2023) train a diffusion model for fine-grained text style transfer on StylePTB, and (Zhang et al. 2025) use the same trick for flexible-length infilling. DiffusionGemma seems tunable, with LoRA recipes via Unsloth, Google’s Hackable Diffusion JAX toolbox, and NVIDIA NeMo. Runs on a Mac.

NB, diffusions look like an awkward fit for coding interfaces.

2 Incoming

Diffusion Beats Autoregressive in Data-Constrained Settings (Prabhudesai et al. 2025)
Inception Labs’ Mercury is the commercial cousin: a diffusion LLM family (Labs et al. 2025) now shipping as Mercury 2 and served on Azure AI Foundry.
GitHub - dvruette/gidd: Code accompanying the paper “Generalized Interpolating Discrete Diffusion”

3 References

Ghazvininejad, Levy, Liu, et al. 2019. “Mask-Predict: Parallel Decoding of Conditional Masked Language Models.”

Labs, Khanna, Kharbanda, et al. 2025. “Mercury: Ultra-Fast Language Models Based on Diffusion.”

Li, Thickstun, Gulrajani, et al. 2022. “Diffusion-LM Improves Controllable Text Generation.” In.

Lyu, Luo, Shi, et al. 2023. “Fine-Grained Text Style Transfer with Diffusion-Based Language Models.” In Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023).

Prabhudesai, Wu, Zadeh, et al. 2025. “Diffusion Beats Autoregressive in Data-Constrained Settings.”

Rütte, Fluri, Ding, et al. 2025. “Generalized Interpolating Discrete Diffusion.”

Savinov, Chung, Binkowski, et al. 2022. “Step-Unrolled Denoising Autoencoders for Text Generation.” In.

Strudel, Tallec, Altché, et al. 2022. “Self-Conditioned Embedding Diffusion for Text Generation.”

Ye, Gong, Chen, et al. 2024. “Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models.” In.

Zhang, Sivakumar, Tang, et al. 2025. “Flexible-Length Text Infilling for Discrete Diffusion Models.”

Zheng, Yuan, Yu, et al. 2024. “A Reparameterized Discrete Diffusion Model for Text Generation.” In.

Zou, Kim, and Kang. 2023. “A Survey of Diffusion Models in Natural Language Processing.”