Neural music synthesis

I have a lot of feelings and ideas about this, but no time to write them down. For now, here are some links and ideas by other people.

Sander Dielemann on waveform-domain neural synthesis. Matt Vitelli on music generation from MP3s (source). Alex Graves on RNN predictive synthesis. Parag Mittal on RNN style transfer.


I’m not massively into spectral-domain synthesis because I think the stationarity assumption is a bit of a stretch (heh). Very much into raw audio me.

Differentiable DSP

This is a really fun idea — do audio processing as normal, but using an NN framework so that the operations are differentiable.

Project site. Github. Twitter intro. Paper. Online supplement. Timbre transfer example. Tutorials.


Pixelrnn turns out to be good at music Dadabots have successfully weaponised samplernn and it’s cute.



Existing generative models for audio have predominantly aimed to directly model time-domain waveforms. MelNet instead aims to model the frequency content of an audio signal. MelNet can be used to model audio unconditionally, making it capable of tasks such as music generation. It can also be conditioned on text and speaker, making it applicable to tasks such as text-to-speech and voice conversion.


Jlin and Holly Herndon show off some artistic use of messed-up neural nets.

Hung-yi Lee and Yu Tsao, Generative Adversarial nets for DSP.


Blaauw, Merlijn, and Jordi Bonada. 2017. A Neural Parametric Singing Synthesizer.” arXiv:1704.03809 [Cs], April.
Carr, C. J., and Zack Zukowski. 2018. Generating Albums with SampleRNN to Imitate Metal, Rock, and Punk Bands.” arXiv:1811.06633 [Cs, Eess], November.
Chen, Nanxin, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. 2020. WaveGrad: Estimating Gradients for Waveform Generation.” arXiv.
Dieleman, Sander, Aäron van den Oord, and Karen Simonyan. 2018. The Challenge of Realistic Music Generation: Modelling Raw Audio at Scale.” In Advances In Neural Information Processing Systems, 11.
Elbaz, Dan, and Michael Zibulevsky. 2017. Perceptual Audio Loss Function for Deep Learning.” In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China.
Engel, Jesse, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, and Mohammad Norouzi. 2017. Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders.” In PMLR.
Goel, Karan, Albert Gu, Chris Donahue, and Christopher Ré. 2022. It’s Raw! Audio Generation with State-Space Models.” arXiv.
Grais, Emad M., Dominic Ward, and Mark D. Plumbley. 2018. Raw Multi-Channel Audio Source Separation Using Multi-Resolution Convolutional Auto-Encoders.” arXiv:1803.00702 [Cs], March.
Hernandez-Olivan, Carlos, Javier Hernandez-Olivan, and Jose R. Beltran. 2022. A Survey on Artificial Intelligence for Music Generation: Agents, Domains and Perspectives.” arXiv.
Kong, Zhifeng, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. 2021. DiffWave: A Versatile Diffusion Model for Audio Synthesis.” arXiv.
Kreuk, Felix, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. 2022. AudioGen: Textually Guided Audio Generation.” arXiv.
Kreuk, Felix, Yaniv Taigman, Adam Polyak, Jade Copet, Gabriel Synnaeve, Alexandre Défossez, and Yossi Adi. 2022. Audio Language Modeling Using Perceptually-Guided Discrete Representations.” arXiv.
Lee, Junhyeok, and Seungu Han. 2021. NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling.” In Interspeech 2021, 1634–38.
Liu, Yuzhou, Balaji Thoshkahna, Ali Milani, and Trausti Kristjansson. 2020. Voice and Accompaniment Separation in Music Using Self-Attention Convolutional Neural Network,” March.
Liutkus, Antoine, Roland Badeau, and Gäel Richard. 2011. Gaussian Processes for Underdetermined Source Separation.” IEEE Transactions on Signal Processing 59 (7): 3155–67.
Mehri, Soroush, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. 2017. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model.” In Proceedings of International Conference on Learning Representations (ICLR) 2017.
Pascual, Santiago, Gautam Bhattacharya, Chunghsin Yeh, Jordi Pons, and Joan Serrà. 2022. Full-Band General Audio Synthesis with Score-Based Diffusion.” arXiv.
Platen, Patrick von, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf. 2022. Diffusers: State-of-the-Art Diffusion Models.” GitHub.
Sarroff, Andy M., and Michael Casey. 2014. Musical Audio Synthesis Using Autoencoding Neural Nets.” In. Ann Arbor, MI: Michigan Publishing, University of Michigan Library.
Schlüter, J., and S. Böck. 2014. Improved Musical Onset Detection with Convolutional Neural Networks.” In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6979–83.
Sprechmann, Pablo, Joan Bruna, and Yann LeCun. 2014. Audio Source Separation with Discriminative Scattering Networks.” arXiv:1412.7022 [Cs], December.
Stöter, Fabian-Robert, Stefan Uhlich, Antoine Liutkus, and Yuki Mitsufuji. 2019. Open-Unmix - A Reference Implementation for Music Source Separation.” Journal of Open Source Software 4 (41): 1667.
Tenenbaum, J. B., and W. T. Freeman. 2000. Separating Style and Content with Bilinear Models.” Neural Computation 12 (6): 1247–83.
Tzinis, Efthymios, Zhepei Wang, and Paris Smaragdis. 2020. “Sudo Rm -Rf: Efficient Networks for Universal Audio Source Separation.” In, 6.
Venkataramani, Shrikant, and Paris Smaragdis. 2017. End to End Source Separation with Adaptive Front-Ends.” arXiv:1705.02514 [Cs], May.
Venkataramani, Shrikant, Y. Cem Subakan, and Paris Smaragdis. 2017. Neural Network Alternatives to Convolutive Audio Models for Source Separation.” arXiv:1709.07908 [Cs, Eess], September.
Verma, Prateek, and Julius O. Smith. 2018. Neural Style Transfer for Audio Spectograms.” In 31st Conference on Neural Information Processing Systems (NIPS 2017).
Wyse, L. 2017. Audio Spectrogram Representations for Processing with Convolutional Neural Networks.” In Proceedings of the First International Conference on Deep Learning and Music, Anchorage, US, May, 2017 (arXiv:1706.08675v1 [Cs.NE]).

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.