Generative music with language+diffusion models

September 16, 2022 β€” December 6, 2023

Figure 1


A special class of generative AI for music. For other alternatives, see nn music.

Here we consider specifically using diffusion models, much like the diffusion image synthesis, but for audio.

(N. Chen et al. 2020; Goel et al. 2022; Hernandez-Olivan, Hernandez-Olivan, and Beltran 2022; Kreuk, Taigman, et al. 2022; Kreuk, Synnaeve, et al. 2022; Lee and Han 2021; Pascual et al. 2022; von Platen et al. 2022)

1 Text-to-music

Not really my jam, but very intersting.

  • CLAP seems to the the dominant labeling method.

  • MusicLM

  • AudioGen: Textually Guided Audio Generation

  • AudioLDM: Text-to-Audio Generation with Latent Diffusion Models - Speech Research

    Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called β€œlanguage of audio” (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate new state-of-the-art or competitive performance to previous approaches.

    • MusicLDM extends this with some interesting music-specific tricks, such as tempo-aware controls
  • MusicGen: Simple and Controllable Music Generation

    We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples can be found on the supplemental materials. Code and models are available on our repo

2 Tooling

3 References

Chen, Ke, Wu, Liu, et al. 2023. β€œMusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies.”
Chen, Nanxin, Zhang, Zen, et al. 2020. β€œWaveGrad: Estimating Gradients for Waveform Generation.”
Copet, Kreuk, Gat, et al. 2023. β€œSimple and Controllable Music Generation.”
Goel, Gu, Donahue, et al. 2022. β€œIt’s Raw! Audio Generation with State-Space Models.”
Hernandez-Olivan, Hernandez-Olivan, and Beltran. 2022. β€œA Survey on Artificial Intelligence for Music Generation: Agents, Domains and Perspectives.”
Kong, Ping, Huang, et al. 2021. β€œDiffWave: A Versatile Diffusion Model for Audio Synthesis.”
Kreuk, Synnaeve, Polyak, et al. 2022. β€œAudioGen: Textually Guided Audio Generation.”
Kreuk, Taigman, Polyak, et al. 2022. β€œAudio Language Modeling Using Perceptually-Guided Discrete Representations.”
Lee, and Han. 2021. β€œNU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling.” In Interspeech 2021.
Levy, Di Giorgi, Weers, et al. 2023. β€œControllable Music Production with Diffusion Models and Guidance Gradients.”
Liu, Tian, Yuan, et al. 2023. β€œAudioLDM 2: Learning Holistic Audio Generation with Self-Supervised Pretraining.”
Pascual, Bhattacharya, Yeh, et al. 2022. β€œFull-Band General Audio Synthesis with Score-Based Diffusion.”
Schneider, Kamal, Jin, et al. 2023. β€œMoΓ»sai: Text-to-Music Generation with Long-Context Latent Diffusion.”
von Platen, Patil, Lozhkov, et al. 2022. β€œDiffusers: State-of-the-Art Diffusion Models.”
Wu, Chen, Zhang, et al. 2023. β€œLarge-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation.”