Neural music synthesis



I have a lot of feelings and ideas about this, but no time to write them down. For now, here are some links and ideas by other people.

Sander Dielemann on waveform-domain neural synthesis. Matt Vitelli on music generation from MP3s (source). Alex Graves on RNN predictive synthesis. Parag Mittal on RNN style transfer.

Models

I’m not massively into spectral-domain synthesis because I think the stationarity assumption is a bit of a stretch (heh). Very much into raw audio me.

Differentiable DSP

This is a really fun idea β€” do audio processing as normal, but using an NN framework so that the operations are differentiable.

Project site. Github. Twitter intro. Paper. Online supplement. Timbre transfer example. Tutorials.

PixelRNN

Pixelrnn turns out to be good at music Dadabots have successfully weaponised samplernn and it’s cute.

Melnet

Melnet

Existing generative models for audio have predominantly aimed to directly model time-domain waveforms. MelNet instead aims to model the frequency content of an audio signal. MelNet can be used to model audio unconditionally, making it capable of tasks such as music generation. It can also be conditioned on text and speaker, making it applicable to tasks such as text-to-speech and voice conversion.

Praxis

Jlin and Holly Herndon show off some artistic use of messed-up neural nets.

Hung-yi Lee and Yu Tsao, Generative Adversarial nets for DSP.

References

Blaauw, Merlijn, and Jordi Bonada. 2017. β€œA Neural Parametric Singing Synthesizer.” arXiv:1704.03809 [Cs], April.
Carr, C. J., and Zack Zukowski. 2018. β€œGenerating Albums with SampleRNN to Imitate Metal, Rock, and Punk Bands.” arXiv:1811.06633 [Cs, Eess], November.
Chen, Nanxin, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. 2020. β€œWaveGrad: Estimating Gradients for Waveform Generation.” arXiv.
Dieleman, Sander, AΓ€ron van den Oord, and Karen Simonyan. 2018. β€œThe Challenge of Realistic Music Generation: Modelling Raw Audio at Scale.” In Advances In Neural Information Processing Systems, 11.
Elbaz, Dan, and Michael Zibulevsky. 2017. β€œPerceptual Audio Loss Function for Deep Learning.” In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China.
Engel, Jesse, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, and Mohammad Norouzi. 2017. β€œNeural Audio Synthesis of Musical Notes with WaveNet Autoencoders.” In PMLR.
Goel, Karan, Albert Gu, Chris Donahue, and Christopher RΓ©. 2022. β€œIt’s Raw! Audio Generation with State-Space Models.” arXiv.
Grais, Emad M., Dominic Ward, and Mark D. Plumbley. 2018. β€œRaw Multi-Channel Audio Source Separation Using Multi-Resolution Convolutional Auto-Encoders.” arXiv:1803.00702 [Cs], March.
Hernandez-Olivan, Carlos, Javier Hernandez-Olivan, and Jose R. Beltran. 2022. β€œA Survey on Artificial Intelligence for Music Generation: Agents, Domains and Perspectives.” arXiv.
Kong, Zhifeng, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. 2021. β€œDiffWave: A Versatile Diffusion Model for Audio Synthesis.” arXiv.
Kreuk, Felix, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre DΓ©fossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. 2022. β€œAudioGen: Textually Guided Audio Generation.” arXiv.
Kreuk, Felix, Yaniv Taigman, Adam Polyak, Jade Copet, Gabriel Synnaeve, Alexandre DΓ©fossez, and Yossi Adi. 2022. β€œAudio Language Modeling Using Perceptually-Guided Discrete Representations.” arXiv.
Lee, Junhyeok, and Seungu Han. 2021. β€œNU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling.” In Interspeech 2021, 1634–38.
Liu, Yuzhou, Balaji Thoshkahna, Ali Milani, and Trausti Kristjansson. 2020. β€œVoice and Accompaniment Separation in Music Using Self-Attention Convolutional Neural Network,” March.
Liutkus, Antoine, Roland Badeau, and GΓ€el Richard. 2011. β€œGaussian Processes for Underdetermined Source Separation.” IEEE Transactions on Signal Processing 59 (7): 3155–67.
Mehri, Soroush, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. 2017. β€œSampleRNN: An Unconditional End-to-End Neural Audio Generation Model.” In Proceedings of International Conference on Learning Representations (ICLR) 2017.
Pascual, Santiago, Gautam Bhattacharya, Chunghsin Yeh, Jordi Pons, and Joan SerrΓ . 2022. β€œFull-Band General Audio Synthesis with Score-Based Diffusion.” arXiv.
Platen, Patrick von, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf. 2022. β€œDiffusers: State-of-the-Art Diffusion Models.” GitHub.
Sarroff, Andy M., and Michael Casey. 2014. β€œMusical Audio Synthesis Using Autoencoding Neural Nets.” In. Ann Arbor, MI: Michigan Publishing, University of Michigan Library.
SchlΓΌter, J., and S. BΓΆck. 2014. β€œImproved Musical Onset Detection with Convolutional Neural Networks.” In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6979–83.
Sprechmann, Pablo, Joan Bruna, and Yann LeCun. 2014. β€œAudio Source Separation with Discriminative Scattering Networks.” arXiv:1412.7022 [Cs], December.
StΓΆter, Fabian-Robert, Stefan Uhlich, Antoine Liutkus, and Yuki Mitsufuji. 2019. β€œOpen-Unmix - A Reference Implementation for Music Source Separation.” Journal of Open Source Software 4 (41): 1667.
Tenenbaum, J. B., and W. T. Freeman. 2000. β€œSeparating Style and Content with Bilinear Models.” Neural Computation 12 (6): 1247–83.
Tzinis, Efthymios, Zhepei Wang, and Paris Smaragdis. 2020. β€œSudo Rm -Rf: Efficient Networks for Universal Audio Source Separation.” In, 6.
Venkataramani, Shrikant, and Paris Smaragdis. 2017. β€œEnd to End Source Separation with Adaptive Front-Ends.” arXiv:1705.02514 [Cs], May.
Venkataramani, Shrikant, Y. Cem Subakan, and Paris Smaragdis. 2017. β€œNeural Network Alternatives to Convolutive Audio Models for Source Separation.” arXiv:1709.07908 [Cs, Eess], September.
Verma, Prateek, and Julius O. Smith. 2018. β€œNeural Style Transfer for Audio Spectograms.” In 31st Conference on Neural Information Processing Systems (NIPS 2017).
Wyse, L. 2017. β€œAudio Spectrogram Representations for Processing with Convolutional Neural Networks.” In Proceedings of the First International Conference on Deep Learning and Music, Anchorage, US, May, 2017 (arXiv:1706.08675v1 [Cs.NE]).

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.