Generative music with language+diffusion models

September 16, 2022 — December 6, 2023

computers are awful
generative art
machine learning
making things
neural nets
Figure 1


A special class of generative AI for music. For other alternatives, see nn music.

Here we consider specifically using diffusion models, much like the diffusion image synthesis, but for audio.

(N. Chen et al. 2020; Goel et al. 2022; Hernandez-Olivan, Hernandez-Olivan, and Beltran 2022; Kreuk, Taigman, et al. 2022; Kreuk, Synnaeve, et al. 2022; Lee and Han 2021; Pascual et al. 2022; von Platen et al. 2022)

1 Text-to-music

Not really my jam, but very intersting.

  • CLAP seems to the the dominant labeling method.

  • MusicLM

  • AudioGen: Textually Guided Audio Generation

  • AudioLDM: Text-to-Audio Generation with Latent Diffusion Models - Speech Research

    Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called “language of audio” (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate new state-of-the-art or competitive performance to previous approaches.

    • MusicLDM extends this with some interesting music-specific tricks, such as tempo-aware controls
  • MusicGen: Simple and Controllable Music Generation

    We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples can be found on the supplemental materials. Code and models are available on our repo

2 Tooling

3 References

Chen, Ke, Wu, Liu, et al. 2023. MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies.”
Chen, Nanxin, Zhang, Zen, et al. 2020. WaveGrad: Estimating Gradients for Waveform Generation.”
Copet, Kreuk, Gat, et al. 2023. Simple and Controllable Music Generation.”
Goel, Gu, Donahue, et al. 2022. It’s Raw! Audio Generation with State-Space Models.”
Hernandez-Olivan, Hernandez-Olivan, and Beltran. 2022. A Survey on Artificial Intelligence for Music Generation: Agents, Domains and Perspectives.”
Kong, Ping, Huang, et al. 2021. DiffWave: A Versatile Diffusion Model for Audio Synthesis.”
Kreuk, Synnaeve, Polyak, et al. 2022. AudioGen: Textually Guided Audio Generation.”
Kreuk, Taigman, Polyak, et al. 2022. Audio Language Modeling Using Perceptually-Guided Discrete Representations.”
Lee, and Han. 2021. NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling.” In Interspeech 2021.
Levy, Di Giorgi, Weers, et al. 2023. Controllable Music Production with Diffusion Models and Guidance Gradients.”
Liu, Tian, Yuan, et al. 2023. AudioLDM 2: Learning Holistic Audio Generation with Self-Supervised Pretraining.”
Pascual, Bhattacharya, Yeh, et al. 2022. Full-Band General Audio Synthesis with Score-Based Diffusion.”
Schneider, Kamal, Jin, et al. 2023. Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion.”
von Platen, Patil, Lozhkov, et al. 2022. Diffusers: State-of-the-Art Diffusion Models.”
Wu, Chen, Zhang, et al. 2023. Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation.”