Neural denoising diffusion models of language

2025-03-12 — 2026-06-15

Wherein Bidirectional Attention Is Shown to Enable Native Text Infilling, and DiffusionGemma Is Presented as a Concrete Instantiation of Discrete Diffusion Over a Jointly Denoised Token Canvas.

approximation
Bayes
generative
Monte Carlo
neural nets
optimization
probabilistic algorithms
probability
score function
statistics
Figure 1

Neural diffusion models, but for generating words instead of pictures. A special kind of discrete diffusion.

1 DiffusionGemma

DiffusionGemma is Google’s open-weights diffusion model, built on the Gemma 4 mixture-of-experts. Where an autoregressive model emits one token at a time, DiffusionGemma fixes a block of tokens — a “canvas” — and refines that block jointly over its denoising steps. At least, at the decoder stage: an autoregressive encoder reads and caches the prompt. The diffusion decoder denoises the generation canvas under bi-directional attention.

The bi-directional attention is kinda cool; AFAICT it is non-causal attention, since the whole canvas is generated at once. An autoregressive model factorises \(p(x) = \prod_t p(x_t \mid x_{<t})\) under a causal mask, so each token attends only to its past. Discrete diffusion has no such constraint (at least, not within a single canvas). A forward process corrupts a clean sequence \(x_0\) into an increasingly noised/stochastically masked \(x_t\) and the model learns a reverse denoiser \(p_\theta(x_0 \mid x_t)\) that predicts the missing positions jointly. A token in the middle conditions on both its left and its right. That makes infilling and edits in the middle of a document a native capability.

I am somewhat interested in what this suggests for style transfer, where we hold a passage’s structure and meaning fixed and change only its register. (Lyu et al. 2023) train a diffusion model for fine-grained text style transfer on StylePTB, and (Zhang et al. 2025) use the same trick for flexible-length infilling. DiffusionGemma seems tunable, with LoRA recipes via Unsloth, Google’s Hackable Diffusion JAX toolbox, and NVIDIA NeMo. Runs on a Mac.

NB, diffusions look like an awkward fit for coding interfaces.

2 Incoming

3 References

Ghazvininejad, Levy, Liu, et al. 2019. Mask-Predict: Parallel Decoding of Conditional Masked Language Models.”
Labs, Khanna, Kharbanda, et al. 2025. Mercury: Ultra-Fast Language Models Based on Diffusion.”
Li, Thickstun, Gulrajani, et al. 2022. Diffusion-LM Improves Controllable Text Generation.” In.
Lyu, Luo, Shi, et al. 2023. Fine-Grained Text Style Transfer with Diffusion-Based Language Models.” In Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023).
Prabhudesai, Wu, Zadeh, et al. 2025. Diffusion Beats Autoregressive in Data-Constrained Settings.”
Rütte, Fluri, Ding, et al. 2025. Generalized Interpolating Discrete Diffusion.”
Savinov, Chung, Binkowski, et al. 2022. Step-Unrolled Denoising Autoencoders for Text Generation.” In.
Strudel, Tallec, Altché, et al. 2022. Self-Conditioned Embedding Diffusion for Text Generation.”
Ye, Gong, Chen, et al. 2024. Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models.” In.
Zhang, Sivakumar, Tang, et al. 2025. Flexible-Length Text Infilling for Discrete Diffusion Models.”
Zheng, Yuan, Yu, et al. 2024. A Reparameterized Discrete Diffusion Model for Text Generation.” In.
Zou, Kim, and Kang. 2023. A Survey of Diffusion Models in Natural Language Processing.”