Neural generative audio
2016-01-15 — 2026-05-04
Wherein the lineage of audio-generation approaches is traced from sample-level autoregression through to latent diffusion, and the DAW plugin as a delivery mechanism for open-weights models is noted.
Neural networks generating audio: music, sound effects, the lot. This is a quick orientation page rather than a thorough survey; I have not been following closely. For the symbolic / MIDI side we have a separate notebook; for speech see voice fakes; for non‑NN signal models see analysis/resynthesis; and the underlying machinery sits in neural diffusion.
1 Where things stand (2025–26)
The current crop of open models:
| Model | SFX | Music | Max length | Sample rate | Paper | Code | Weights |
|---|---|---|---|---|---|---|---|
| Stable Audio Open 1.0 | ✅ | ✅ | 47 s | 44.1 kHz stereo | (Evans et al. 2024) | stability-ai/stable-audio-tools | HF |
| AudioGen (AudioCraft) | ✅ | ❌ | ~10 s | 16 kHz mono | (Kreuk, Synnaeve, et al. 2022) | facebookresearch/audiocraft | HF |
| MusicGen (AudioCraft) | ❌ | ✅ | 30 s native, longer via sliding window | 32 kHz | (Copet et al. 2023) | facebookresearch/audiocraft | HF |
| AudioLDM / AudioLDM 2 | ✅ | ✅ | ~10 s native | 16 kHz (v1), 48 kHz checkpoint (v2) | (H. Liu, Tian, et al. 2023; H. Liu, Chen, et al. 2023) | haoheliu/AudioLDM2 | HF |
AudioCraft ships training scripts; stable-audio-tools ships training scaffolding too, with the released open weights being for non‑commercial use under the Stability AI Community License.
Stable Audio Open can also be served as an endpoint via vLLM‑Omni, which is convenient if we want a dedicated audio‑generation service.
1.1 In the DAW
OBSIDIAN Neural is a free AGPL‑3.0 VST3 / AU plugin (Windows / macOS / Linux) which wraps Stable Audio Open with MIDI triggering and tempo sync, plus several specialised fine‑tunes. The repository is still at innermost47/ai-dj under its old name; the homepage and README reflect the rebrand. It is, as the README puts it, “30 seconds of patience per loop.”
2 How we got here
A compressed timeline. NB I have been checked out for the last 3 years and have missed much progress.
2.1 2016–18 — Raw waveform, sample by sample
WaveNet (DeepMind) and SampleRNN (Mehri et al. 2017) modelled audio one sample at a time with autoregressive networks. NSynth / WaveNet autoencoders (Engel et al. 2017) applied the idea to musical timbre. Dadabots’ SampleRNN metal albums are the entertaining proof‑of‑concept.
Sander Dieleman’s 2020 essay on waveform‑domain synthesis is still the best orientation piece for this era.
2.2 2018–20 — Alternatives to autoregression
Magenta’s DDSP (code) embedded classical synthesis modules — oscillators, filters, reverb — as differentiable layers, so we get parameter inference instead of waveform regression. The timbre transfer demo still demos well, even though Magenta itself has stagnated.
GANSynth used GANs over spectrogram representations; WaveGAN did the same for raw waveforms. MelNet went big on conditional spectrogram modelling. OpenAI’s Jukebox was the most ambitious of the era — a hierarchical VQ‑VAE plus autoregressive transformer trained on raw music — and is mostly historical now, superseded by latent diffusion.
2.3 2020–22 — Diffusion arrives
WaveGrad (N. Chen et al. 2020) and DiffWave (Kong et al. 2021) demonstrated denoising diffusion on raw audio. SaShiMi (Goel et al. 2022) showed that structured state‑space models could model raw audio as well as anything autoregressive — see the examples and code. NU‑Wave (Lee and Han 2021) and Pascual et al (Pascual et al. 2022) applied diffusion to upsampling and full‑band synthesis respectively.
2.4 2022–23 — Text conditioning
CLAP (Wu et al. 2023) gave us a contrastive joint embedding for text and audio, which is what most of the text‑to‑audio models use under the hood. Meta’s AudioGen (Kreuk, Synnaeve, et al. 2022) and Google’s MusicLM got the text‑to‑audio and text‑to‑music ball rolling. MusicGen (Copet et al. 2023) (the AudioCraft one) collapsed the cascade into a single transformer LM over EnCodec tokens.
2.5 2023–24 — Latent diffusion at scale
AudioLDM (H. Liu, Tian, et al. 2023) and AudioLDM 2 (H. Liu, Chen, et al. 2023) moved to latent diffusion, with AudioLDM 2’s “language of audio” abstraction unifying speech, music, and SFX. MusicLDM (K. Chen et al. 2023) added music‑specific tricks like tempo‑aware conditioning. Mousai (Schneider et al. 2023) is a similar latent‑diffusion approach. Stable Audio (Evans et al. 2024) went stereo and longer (47 s).
I’m not massively into spectral‑domain synthesis because I think the stationarity assumption is a stretch (heh). Or rather, my contrarian instinct says working in the Fourier domain leaves audio quality on the table — transient attack on percussion, the way a struck string rings out, that kind of thing — even though it is true that e.g. latent‑diffusion systems work on spectral‑adjacent latents and apparently get away with it.
2.6 2024–26 — Open weights and DAW integration
Stable Audio Open released the weights publicly. OBSIDIAN Neural and friends started bringing model‑in‑plugin workflows into actual DAWs, with all the live‑performance implications that entails.
3 Methods, briefly
A short conceptual map.
3.1 Latent diffusion
Now the dominant pattern. A VAE compresses waveforms into a much lower‑rate latent space; diffusion runs in the latent space; conditioning is by text embedding (CLAP, T5) cross‑attended into the diffusion transformer or U‑Net. Stable Audio, AudioLDM, MusicLDM, Mousai are all variations on this theme. See neural diffusion for the underlying machinery.
3.2 Token‑based language models
Encode audio as discrete tokens with a neural codec (EnCodec, SoundStream); train a transformer LM to predict the tokens autoregressively; decode the predicted tokens back to audio. MusicGen and AudioGen are the canonical examples.
3.3 Differentiable DSP
DDSP threads a different needle: keep the classical synthesis topology (oscillator, filter, reverb), make the parameters differentiable, learn parameter trajectories from audio. The output is then by construction a real synthesis chain, not a regressed waveform.
3.4 State‑space models
SaShiMi (Goel et al. 2022) and successors. Not currently the state of the art for music generation (which is surprsing to me TBH— so many connections to traditional synthesis methods) but a tidy alternative to attention for long sequences.
4 Tooling
- Hugging Face
diffusers(von Platen et al. 2022) is the de facto standard for running these models; the intro notebook is the obvious starting point. facebookresearch/audiocraft— MusicGen, AudioGen, EnCodec.Stability-AI/stable-audio-tools— Stable Audio Open weights and training scaffolding.archinetai/audio-diffusion-pytorch— research codebase for audio diffusion.- acids‑ircam diffusion notebooks — pedagogical implementations of waveform diffusion.
5 Praxis and politics
Streaming platforms are now flooded with AI slop songs cranked out at low marginal cost, royalty pools are getting diluted, and working musicians’ incomes are getting squeezed.
The counter‑current is composers and producers who treat the models as instruments rather than as cheap musician‑replacements. Examples: Jlin and Holly Herndon showed early on what an AI‑forward composition practice could look like — the model as collaborator and as instrument, the glitches as material. GAN.STYLE is in a similar vein.
6 See also
- generative music — symbolic / MIDI / score‑level generation.
- voice fakes — speech and singing synthesis.
- analysis/resynthesis — non‑NN audio signal models.
- neural diffusion — the diffusion machinery underneath everything in §2024.
- arpeggiate by numbers — music theory, the way a programmer might want to learn it.
- machine listening — the analysis side of the same coin.
