Neural music synthesis

January 15, 2016 — October 14, 2021

Figure 1

I have a lot of feelings and ideas about this, but no time to write them down. For now, here are some links and ideas by other people.

Sander Dielemann on waveform-domain neural synthesis. Matt Vitelli on music generation from MP3s (source). Alex Graves on RNN predictive synthesis. Parag Mittal on RNN style transfer.

1 Models

I’m not massively into spectral-domain synthesis because I think the stationarity assumption is a bit of a stretch (heh). Very much into raw audio me.

1.1 Diffusion

see generative music diffusion.

1.2 Differentiable DSP

This is a really fun idea — do audio processing as normal, but using an NN framework so that the operations are differentiable.

Project site. Github. Twitter intro. Paper. Online supplement. Timbre transfer example. Tutorials.

1.3 PixelRNN

Pixelrnn turns out to be good at music Dadabots have successfully weaponised samplernn and it’s cute.

1.4 Jukebox

Open AI Jukebox is the latest hot generative music thing that I should be across.

1.5 Gansynth

1.6 Wavegan

1.7 Neuralfunk

1.8 Melnet


Existing generative models for audio have predominantly aimed to directly model time-domain waveforms. MelNet instead aims to model the frequency content of an audio signal. MelNet can be used to model audio unconditionally, making it capable of tasks such as music generation. It can also be conditioned on text and speaker, making it applicable to tasks such as text-to-speech and voice conversion.

1.9 State spaces

(Dieleman, Oord, and Simonyan 2018; Goel et al. 2022)

2 Praxis

Jlin and Holly Herndon show off some artistic use of messed-up neural nets.

Hung-yi Lee and Yu Tsao, Generative Adversarial nets for DSP.

3 Incoming

What is Loris?

Soundtracking audio from video.

Andy Sarrof, Musical Audio Synthesis Using Autoencoding Neural Nets. (code)

4 Products

5 References

Blaauw, and Bonada. 2017. A Neural Parametric Singing Synthesizer.” arXiv:1704.03809 [Cs].
Carr, and Zukowski. 2018. Generating Albums with SampleRNN to Imitate Metal, Rock, and Punk Bands.” arXiv:1811.06633 [Cs, Eess].
Chen, Zhang, Zen, et al. 2020. WaveGrad: Estimating Gradients for Waveform Generation.”
Dieleman, Oord, and Simonyan. 2018. The Challenge of Realistic Music Generation: Modelling Raw Audio at Scale.” In Advances In Neural Information Processing Systems.
Du, Collins, Tenenbaum, et al. 2021. Learning Signal-Agnostic Manifolds of Neural Fields.” In Advances in Neural Information Processing Systems.
Dupont, Kim, Eslami, et al. 2022. From Data to Functa: Your Data Point Is a Function and You Can Treat It Like One.” In Proceedings of the 39th International Conference on Machine Learning.
Elbaz, and Zibulevsky. 2017. Perceptual Audio Loss Function for Deep Learning.” In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China.
Engel, Resnick, Roberts, et al. 2017. Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders.” In PMLR.
Goel, Gu, Donahue, et al. 2022. It’s Raw! Audio Generation with State-Space Models.”
Grais, Ward, and Plumbley. 2018. Raw Multi-Channel Audio Source Separation Using Multi-Resolution Convolutional Auto-Encoders.” arXiv:1803.00702 [Cs].
Hernandez-Olivan, Hernandez-Olivan, and Beltran. 2022. A Survey on Artificial Intelligence for Music Generation: Agents, Domains and Perspectives.”
Kong, Ping, Huang, et al. 2021. DiffWave: A Versatile Diffusion Model for Audio Synthesis.”
Kreuk, Synnaeve, Polyak, et al. 2022. AudioGen: Textually Guided Audio Generation.”
Kreuk, Taigman, Polyak, et al. 2022. Audio Language Modeling Using Perceptually-Guided Discrete Representations.”
Lee, and Han. 2021. NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling.” In Interspeech 2021.
Liu, Thoshkahna, Milani, et al. 2020. Voice and Accompaniment Separation in Music Using Self-Attention Convolutional Neural Network.”
Liutkus, Badeau, and Richard. 2011. Gaussian Processes for Underdetermined Source Separation.” IEEE Transactions on Signal Processing.
Luo, Du, Tarr, et al. 2021. Learning Neural Acoustic Fields.” In.
Mehri, Kumar, Gulrajani, et al. 2017. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model.” In Proceedings of International Conference on Learning Representations (ICLR) 2017.
Pascual, Bhattacharya, Yeh, et al. 2022. Full-Band General Audio Synthesis with Score-Based Diffusion.”
Sarroff, and Casey. 2014. Musical Audio Synthesis Using Autoencoding Neural Nets.” In.
Schlüter, and Böck. 2014. Improved Musical Onset Detection with Convolutional Neural Networks.” In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Sprechmann, Bruna, and LeCun. 2014. Audio Source Separation with Discriminative Scattering Networks.” arXiv:1412.7022 [Cs].
Stöter, Uhlich, Liutkus, et al. 2019. Open-Unmix - A Reference Implementation for Music Source Separation.” Journal of Open Source Software.
Tenenbaum, and Freeman. 2000. Separating Style and Content with Bilinear Models.” Neural Computation.
Tzinis, Wang, and Smaragdis. 2020. “Sudo Rm -Rf: Efficient Networks for Universal Audio Source Separation.” In.
Venkataramani, and Smaragdis. 2017. End to End Source Separation with Adaptive Front-Ends.” arXiv:1705.02514 [Cs].
Venkataramani, Subakan, and Smaragdis. 2017. Neural Network Alternatives to Convolutive Audio Models for Source Separation.” arXiv:1709.07908 [Cs, Eess].
Verma, and Smith. 2018. Neural Style Transfer for Audio Spectograms.” In 31st Conference on Neural Information Processing Systems (NIPS 2017).
von Platen, Patil, Lozhkov, et al. 2022. Diffusers: State-of-the-Art Diffusion Models.”
Wyse. 2017. Audio Spectrogram Representations for Processing with Convolutional Neural Networks.” In Proceedings of the First International Conference on Deep Learning and Music, Anchorage, US, May, 2017 (arXiv:1706.08675v1 [Cs.NE]).
Xu, Wang, Jiang, et al. 2022. Signal Processing for Implicit Neural Representations.” In.