Neural generative audio

2016-01-15 — 2026-05-04

Wherein the lineage of audio-generation approaches is traced from sample-level autoregression through to latent diffusion, and the DAW plugin as a delivery mechanism for open-weights models is noted.

generative art
machine learning
machine listening
making things
music
neural nets
signal processing
Figure 1

Neural networks generating audio: music, sound effects, the lot. This is a quick orientation page rather than a thorough survey; I have not been following closely. For the symbolic / MIDI side we have a separate notebook; for speech see voice fakes; for non‑NN signal models see analysis/resynthesis; and the underlying machinery sits in neural diffusion.

1 Where things stand (2025–26)

The current crop of open models:

Model SFX Music Max length Sample rate Paper Code Weights
Stable Audio Open 1.0 47 s 44.1 kHz stereo (Evans et al. 2024) stability-ai/stable-audio-tools HF
AudioGen (AudioCraft) ~10 s 16 kHz mono (Kreuk, Synnaeve, et al. 2022) facebookresearch/audiocraft HF
MusicGen (AudioCraft) 30 s native, longer via sliding window 32 kHz (Copet et al. 2023) facebookresearch/audiocraft HF
AudioLDM / AudioLDM 2 ~10 s native 16 kHz (v1), 48 kHz checkpoint (v2) (H. Liu, Tian, et al. 2023; H. Liu, Chen, et al. 2023) haoheliu/AudioLDM2 HF

AudioCraft ships training scripts; stable-audio-tools ships training scaffolding too, with the released open weights being for non‑commercial use under the Stability AI Community License.

Stable Audio Open can also be served as an endpoint via vLLM‑Omni, which is convenient if we want a dedicated audio‑generation service.

1.1 In the DAW

OBSIDIAN Neural is a free AGPL‑3.0 VST3 / AU plugin (Windows / macOS / Linux) which wraps Stable Audio Open with MIDI triggering and tempo sync, plus several specialised fine‑tunes. The repository is still at innermost47/ai-dj under its old name; the homepage and README reflect the rebrand. It is, as the README puts it, “30 seconds of patience per loop.”

2 How we got here

A compressed timeline. NB I have been checked out for the last 3 years and have missed much progress.

2.1 2016–18 — Raw waveform, sample by sample

WaveNet (DeepMind) and SampleRNN (Mehri et al. 2017) modelled audio one sample at a time with autoregressive networks. NSynth / WaveNet autoencoders (Engel et al. 2017) applied the idea to musical timbre. Dadabots’ SampleRNN metal albums are the entertaining proof‑of‑concept.

Sander Dieleman’s 2020 essay on waveform‑domain synthesis is still the best orientation piece for this era.

2.2 2018–20 — Alternatives to autoregression

Magenta’s DDSP (code) embedded classical synthesis modules — oscillators, filters, reverb — as differentiable layers, so we get parameter inference instead of waveform regression. The timbre transfer demo still demos well, even though Magenta itself has stagnated.

GANSynth used GANs over spectrogram representations; WaveGAN did the same for raw waveforms. MelNet went big on conditional spectrogram modelling. OpenAI’s Jukebox was the most ambitious of the era — a hierarchical VQ‑VAE plus autoregressive transformer trained on raw music — and is mostly historical now, superseded by latent diffusion.

2.3 2020–22 — Diffusion arrives

WaveGrad (N. Chen et al. 2020) and DiffWave (Kong et al. 2021) demonstrated denoising diffusion on raw audio. SaShiMi (Goel et al. 2022) showed that structured state‑space models could model raw audio as well as anything autoregressive — see the examples and code. NU‑Wave (Lee and Han 2021) and Pascual et al (Pascual et al. 2022) applied diffusion to upsampling and full‑band synthesis respectively.

2.4 2022–23 — Text conditioning

CLAP (Wu et al. 2023) gave us a contrastive joint embedding for text and audio, which is what most of the text‑to‑audio models use under the hood. Meta’s AudioGen (Kreuk, Synnaeve, et al. 2022) and Google’s MusicLM got the text‑to‑audio and text‑to‑music ball rolling. MusicGen (Copet et al. 2023) (the AudioCraft one) collapsed the cascade into a single transformer LM over EnCodec tokens.

2.5 2023–24 — Latent diffusion at scale

AudioLDM (H. Liu, Tian, et al. 2023) and AudioLDM 2 (H. Liu, Chen, et al. 2023) moved to latent diffusion, with AudioLDM 2’s “language of audio” abstraction unifying speech, music, and SFX. MusicLDM (K. Chen et al. 2023) added music‑specific tricks like tempo‑aware conditioning. Mousai (Schneider et al. 2023) is a similar latent‑diffusion approach. Stable Audio (Evans et al. 2024) went stereo and longer (47 s).

I’m not massively into spectral‑domain synthesis because I think the stationarity assumption is a stretch (heh). Or rather, my contrarian instinct says working in the Fourier domain leaves audio quality on the table — transient attack on percussion, the way a struck string rings out, that kind of thing — even though it is true that e.g. latent‑diffusion systems work on spectral‑adjacent latents and apparently get away with it.

2.6 2024–26 — Open weights and DAW integration

Stable Audio Open released the weights publicly. OBSIDIAN Neural and friends started bringing model‑in‑plugin workflows into actual DAWs, with all the live‑performance implications that entails.

3 Methods, briefly

A short conceptual map.

3.1 Latent diffusion

Now the dominant pattern. A VAE compresses waveforms into a much lower‑rate latent space; diffusion runs in the latent space; conditioning is by text embedding (CLAP, T5) cross‑attended into the diffusion transformer or U‑Net. Stable Audio, AudioLDM, MusicLDM, Mousai are all variations on this theme. See neural diffusion for the underlying machinery.

3.2 Token‑based language models

Encode audio as discrete tokens with a neural codec (EnCodec, SoundStream); train a transformer LM to predict the tokens autoregressively; decode the predicted tokens back to audio. MusicGen and AudioGen are the canonical examples.

3.3 Differentiable DSP

DDSP threads a different needle: keep the classical synthesis topology (oscillator, filter, reverb), make the parameters differentiable, learn parameter trajectories from audio. The output is then by construction a real synthesis chain, not a regressed waveform.

3.4 State‑space models

SaShiMi (Goel et al. 2022) and successors. Not currently the state of the art for music generation (which is surprsing to me TBH— so many connections to traditional synthesis methods) but a tidy alternative to attention for long sequences.

4 Tooling

5 Praxis and politics

Streaming platforms are now flooded with AI slop songs cranked out at low marginal cost, royalty pools are getting diluted, and working musicians’ incomes are getting squeezed.

The counter‑current is composers and producers who treat the models as instruments rather than as cheap musician‑replacements. Examples: Jlin and Holly Herndon showed early on what an AI‑forward composition practice could look like — the model as collaborator and as instrument, the glitches as material. GAN.STYLE is in a similar vein.

6 See also

7 References

Blaauw, and Bonada. 2017. A Neural Parametric Singing Synthesizer.” arXiv:1704.03809 [Cs].
Carr, and Zukowski. 2018. Generating Albums with SampleRNN to Imitate Metal, Rock, and Punk Bands.” arXiv:1811.06633 [Cs, Eess].
Chen, Ke, Wu, Liu, et al. 2023. MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies.”
Chen, Nanxin, Zhang, Zen, et al. 2020. WaveGrad: Estimating Gradients for Waveform Generation.”
Copet, Kreuk, Gat, et al. 2023. Simple and Controllable Music Generation.”
Dieleman, Oord, and Simonyan. 2018. The Challenge of Realistic Music Generation: Modelling Raw Audio at Scale.” In Advances In Neural Information Processing Systems.
Du, Collins, Tenenbaum, et al. 2021. Learning Signal-Agnostic Manifolds of Neural Fields.” In Advances in Neural Information Processing Systems.
Dupont, Kim, Eslami, et al. 2022. From Data to Functa: Your Data Point Is a Function and You Can Treat It Like One.” In Proceedings of the 39th International Conference on Machine Learning.
Elbaz, and Zibulevsky. 2017. Perceptual Audio Loss Function for Deep Learning.” In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China.
Engel, Resnick, Roberts, et al. 2017. Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders.” In PMLR.
Evans, Parker, Carr, et al. 2024. Stable Audio Open.”
Goel, Gu, Donahue, et al. 2022. It’s Raw! Audio Generation with State-Space Models.”
Grais, Ward, and Plumbley. 2018. Raw Multi-Channel Audio Source Separation Using Multi-Resolution Convolutional Auto-Encoders.” arXiv:1803.00702 [Cs].
Hernandez-Olivan, Hernandez-Olivan, and Beltran. 2022. A Survey on Artificial Intelligence for Music Generation: Agents, Domains and Perspectives.”
Kong, Ping, Huang, et al. 2021. DiffWave: A Versatile Diffusion Model for Audio Synthesis.”
Kreuk, Synnaeve, Polyak, et al. 2022. AudioGen: Textually Guided Audio Generation.”
Kreuk, Taigman, Polyak, et al. 2022. Audio Language Modeling Using Perceptually-Guided Discrete Representations.”
Lee, and Han. 2021. NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling.” In Interspeech 2021.
Levy, Di Giorgi, Weers, et al. 2023. Controllable Music Production with Diffusion Models and Guidance Gradients.”
Liu, Haohe, Chen, Yuan, et al. 2023. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models.”
Liu, Yuzhou, Thoshkahna, Milani, et al. 2020. Voice and Accompaniment Separation in Music Using Self-Attention Convolutional Neural Network.”
Liu, Haohe, Tian, Yuan, et al. 2023. AudioLDM 2: Learning Holistic Audio Generation with Self-Supervised Pretraining.”
Liutkus, Badeau, and Richard. 2011. Gaussian Processes for Underdetermined Source Separation.” IEEE Transactions on Signal Processing.
Luo, Du, Tarr, et al. 2021. Learning Neural Acoustic Fields.” In.
Mehri, Kumar, Gulrajani, et al. 2017. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model.” In Proceedings of International Conference on Learning Representations (ICLR) 2017.
Pascual, Bhattacharya, Yeh, et al. 2022. Full-Band General Audio Synthesis with Score-Based Diffusion.”
Sarroff, and Casey. 2014. Musical Audio Synthesis Using Autoencoding Neural Nets.” In.
Schlüter, and Böck. 2014. Improved Musical Onset Detection with Convolutional Neural Networks.” In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Schneider, Kamal, Jin, et al. 2023. Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion.”
Sprechmann, Bruna, and LeCun. 2014. Audio Source Separation with Discriminative Scattering Networks.” arXiv:1412.7022 [Cs].
Stöter, Uhlich, Liutkus, et al. 2019. Open-Unmix - A Reference Implementation for Music Source Separation.” Journal of Open Source Software.
Tenenbaum, and Freeman. 2000. Separating Style and Content with Bilinear Models.” Neural Computation.
Tzinis, Wang, and Smaragdis. 2020. “Sudo Rm -Rf: Efficient Networks for Universal Audio Source Separation.” In.
Venkataramani, and Smaragdis. 2017. End to End Source Separation with Adaptive Front-Ends.” arXiv:1705.02514 [Cs].
Venkataramani, Subakan, and Smaragdis. 2017. Neural Network Alternatives to Convolutive Audio Models for Source Separation.” arXiv:1709.07908 [Cs, Eess].
Verma, and Smith. 2018. Neural Style Transfer for Audio Spectograms.” In 31st Conference on Neural Information Processing Systems (NIPS 2017).
von Platen, Patil, Lozhkov, et al. 2022. Diffusers: State-of-the-Art Diffusion Models.”
Wu, Chen, Zhang, et al. 2023. Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation.”
Wyse. 2017. Audio Spectrogram Representations for Processing with Convolutional Neural Networks.” In Proceedings of the First International Conference on Deep Learning and Music, Anchorage, US, May, 2017 (arXiv:1706.08675v1 [Cs.NE]).
Xu, Wang, Jiang, et al. 2022. Signal Processing for Implicit Neural Representations.” In.