Audio source separation

Decomposing audio into discrete sources, especially commercial tracks into stems. This is in large part a problem of data acquisition since artists do not usually release unmixed versions of their tracks.

The taxonomy here comes from Jordi Pons’ tutorial Waveform-based music processing with deep learning.

Neural approaches

In the time domain, facebook’s demucs gets startlingly good performance on MusDB (startlingly good in that, if I have understood correctly, they train only on MusDB wccih is a small data set compared to what the big players such as Spotify have access to, so they have good priors.)

In the spectral domain, Deezer have released Spleeter (Hennequin et al. 2019)

Spleeter is the Deezer source separation library with pretrained models written in Python and uses Tensorflow. It makes it easy to train source separation model (assuming you have a dataset of isolated sources), and provides already trained state of the art model for performing various flavour of separation :

  • Vocals (singing voice) / accompaniment separation (2 stems)
  • Vocals / drums / bass / other separation (4 stems)
  • Vocals / drums / bass / piano / other separation (5 stems)

They are competing with open unmix, (Stöter et al. 2019) which also looks neat and has a live demo.

Wave-U-Net architectures seems popular if one wants to DIY.

GANs seem natural here, although most methods are supervised, or better, a probabilistic method.

Non-negative matrix factorisation approaches

Factorise the spectrogram! Authors such as (T. Virtanen 2007; Bertin, Badeau, and Vincent 2010; Vincent, Bertin, and Badeau 2008; Févotte, Bertin, and Durrieu 2008; Smaragdis 2004) popularised using non-negative matrix factorisations to identify the “activations” of power spectrograms for music analysis. It didn’t take long for this to be used in resynthesis tasks, by e.g. Aarabi and Peeters (2018), Buch, Quinton, and Sturm (2017) (source, site), Driedger and Pratzlich (2015) (site), Hoffman, Blei, and Cook (2010)). Of course, these methods leave you with a phase retrieval problem. These methods work really well in resyntheis, where you don not care so much about audio bleed.

UNMIXER (Smith, Kawasaki, and Goto 2019) is a nifty online interface to loop decompositioinn in this framework.


That first step might be to find some model which can approximately capture the cyclic and disordered components of the signal. Indeed Metamorph and smstools, based on a “sinusoids+noise” model do this kind of decomposition, but they mostly use it for resynthesis in limited ways, not simulating realisations from the inferred model of an underlying stochastic process. There is an implementation in csound called ATS which looks interesting?

Some non-parametric conditional wavelet density sounds more fun to me, maybe as a Markov random field - although what exact generative model I would fit is still opaque to me. The sequence probably possesses multiple at scales, and there is evidence that music might have a recursive grammatical structure which would be hard to learn even if we had a perfect decomposition.


Aarabi, Hadrien Foroughmand, and Geoffroy Peeters. 2018. Music Retiler: Using NMF2D Source Separation for Audio Mosaicing.” In Proceedings of the Audio Mostly 2018 on Sound in Immersion and Emotion, 27:1–7. AM’18. New York, NY, USA: ACM.
Alvarado, Pablo A., Mauricio A. Alvarez, and Dan Stowell. 2019. Sparse Gaussian Process Audio Source Separation Using Spectrum Priors in the Time-Domain.” In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 995–99.
Alvarado, Pablo A., and Dan Stowell. 2018. Efficient Learning of Harmonic Priors for Pitch Detection in Polyphonic Music.” arXiv:1705.07104 [Cs, Stat], November.
Bertin, N., R. Badeau, and E. Vincent. 2010. Enforcing Harmonicity and Smoothness in Bayesian Non-Negative Matrix Factorization Applied to Polyphonic Music Transcription.” IEEE Transactions on Audio, Speech, and Language Processing 18 (3): 538–49.
Blaauw, Merlijn, and Jordi Bonada. 2017. A Neural Parametric Singing Synthesizer.” arXiv:1704.03809 [Cs], April.
Blumensath, Thomas, and Mike Davies. 2006. Sparse and Shift-Invariant Representations of Music.” IEEE Transactions on Audio, Speech and Language Processing 14 (1): 50–57.
Buch, Michael, Elio Quinton, and Bob L Sturm. 2017. “NichtnegativeMatrixFaktorisierungnutzendesKlangsynthesenSystem (NiMFKS): Extensions of NMF-Based Concatenative Sound Synthesis.” In Proceedings of the 20th International Conference on Digital Audio Effects, 7. Edinburgh.
Castro, Pablo de, and Tommaso Dorigo. 2019. INFERNO: Inference-Aware Neural Optimisation.” Computer Physics Communications 244 (November): 170–79.
Cichocki, A., R. Zdunek, and S. Amari. 2006. New Algorithms for Non-Negative Matrix Factorization in Applications to Blind Source Separation.” In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, 5:V–.
Driedger, Jonathan, Mathias Muller, and Sebastian Ewert. 2014. Improving Time-Scale Modification of Music Signals Using Harmonic-Percussive Separation.” IEEE Signal Processing Letters 21 (1): 105–9.
Driedger, Jonathan, and Meinard Müller. 2016. A Review of Time-Scale Modification of Music Signals.” Applied Sciences 6 (2): 57.
Driedger, Jonathan, Meinard Müller, and Sascha Disch. 2014. Extending Harmonic-Percussive Separation of Audio Signals. In ISMIR, 611–16.
Driedger, Jonathan, and Thomas Pratzlich. 2015. Let It Bee – Towards NMF-Inspired Audio Mosaicing.” In Proceedings of ISMIR, 7. Malaga.
Elowsson, Anders, and Anders Friberg. 2017. “Long-Term Average Spectrum in Popular Music and Its Relation to the Level of the Percussion.” In Audio Engineering Society Convention 142, 13. Audio Engineering Society.
Févotte, Cédric, Nancy Bertin, and Jean-Louis Durrieu. 2008. Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis.” Neural Computation 21 (3): 793–830.
Févotte, Cédric, and Jérôme Idier. 2011. Algorithms for Nonnegative Matrix Factorization with the β-Divergence.” Neural Computation 23 (9): 2421–56.
Fitzgerald, Derry. 2010. Harmonic/Percussive Separation Using Median Filtering.”
FitzGerald, Derry, Antoine Liukus, Zafar Rafii, Bryan Pardo, and Laurent Daudet. 2013. Harmonic/Percussive Separation Using Kernel Additive Modelling.” In Irish Signals & Systems Conference 2014 and 2014 China-Ireland International Conference on Information and Communications Technologies (ISSC 2014/CIICT 2014). 25th IET, 35–40. IET.
Grais, Emad M., Dominic Ward, and Mark D. Plumbley. 2018. Raw Multi-Channel Audio Source Separation Using Multi-Resolution Convolutional Auto-Encoders.” arXiv:1803.00702 [Cs], March.
Gribonval, R. 2003. Piecewise Linear Source Separation.” In Proc. Soc. Photographic Instrumentation Eng., 5207:297–310. San Diego, CA, USA.
Helén, M., and T. Virtanen. 2005. Separation of Drums from Polyphonic Music Using Non-Negative Matrix Factorization and Support Vector Machine.” In Signal Processing Conference, 2005 13th European, 1–4.
Hennequin, Romain, Anis Khlif, Felix Voituret, and Manuel Moussallam. 2019. “Spleeter: A Fast and State-of-the Art Music Source Separation Tool with Pre-Trained Models.” In, 2.
Hoffman, Matthew D, David M Blei, and Perry R Cook. 2010. Bayesian Nonparametric Matrix Factorization for Recorded Music.” In International Conference on Machine Learning, 8.
Hsieh, H., and J. Chien. 2011. Nonstationary and Temporally Correlated Source Separation Using Gaussian Process.” In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2120–23.
Jayaram, Vivek, and John Thickstun. 2020. Source Separation with Deep Generative Priors.” arXiv:2002.07942 [Cs, Stat], February.
Klapuri, A., T. Virtanen, and T. Heittola. 2010. Sound Source Separation in Monaural Music Signals Using Excitation-Filter Model and Em Algorithm.” In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 5510–13.
Lakatos, Stephen. 2000. A Common Perceptual Space for Harmonic and Percussive Timbres.” Perception & Psychophysics 62 (7): 1426–39.
Laroche, Clément, Hélène Papadopoulos, Matthieu Kowalski, and Gaël Richard. 2017. Drum Extraction in Single Channel Audio Signals Using Multi-Layer Non Negative Matrix Factor Deconvolution.” In ICASSP. Nouvelle Orleans, United States.
Leglaive, Simon, Roland Badeau, and Gaël Richard. 2017. Multichannel Audio Source Separation: Variational Inference of Time-Frequency Sources from Time-Domain Observations.” In 42nd International Conference on Acoustics, Speech and Signal Processing (ICASSP). Proc. 42nd International Conference on Acoustics, Speech and Signal Processing (ICASSP). La Nouvelle Orléans, LA, United States: IEEE.
Levin, David N. 2017. The Inner Structure of Time-Dependent Signals.” arXiv:1703.08596 [Cs, Math, Stat], March.
Liu, Yuzhou, Balaji Thoshkahna, Ali Milani, and Trausti Kristjansson. 2020. Voice and Accompaniment Separation in Music Using Self-Attention Convolutional Neural Network,” March.
Liutkus, Antoine, Roland Badeau, and Gäel Richard. 2011. Gaussian Processes for Underdetermined Source Separation.” IEEE Transactions on Signal Processing 59 (7): 3155–67.
Liutkus, Antoine, Zafar Rafii, Bryan Pardo, Derry Fitzgerald, and Laurent Daudet. 2014. Kernel Spectrogram Models for Source Separation.” In, 6–10. IEEE.
Ma, Ning, Phil Green, Jon Barker, and André Coy. 2007. Exploiting Correlogram Structure for Robust Speech Recognition with Multiple Speech Sources.” Speech Communication 49 (12): 874–91.
Miron, Marius, Julio J. Carabias-Orti, Juan J. Bosch, Gó, Emilia Mez, and Jordi Janer. 2016. Score-Informed Source Separation for Multichannel Orchestral Recordings.” Journal of Electrical and Computer Engineering 2016 (December): e8363507.
Ó Nuanáin, Cárthach, Sergi Jordà Puig, and Perfecto Herrera Boyer. 2016. An interactive software instrument for real-time rhythmic concatenative synthesis.”
Ono, N., K. Miyamoto, J. Le Roux, H. Kameoka, and S. Sagayama. 2008. “Separation of a Monaural Audio Signal into Harmonic/Percussive Components by Complementary Diffusion on Spectrogram.” In Signal Processing Conference, 2008 16th European, 1–4.
Ono, Nobutaka, Kenichi Miyamoto, Hirokazu Kameoka, and Shigeki Sagayama. 2008. A Real-Time Equalizer of Harmonic and Percussive Components in Music Signals. In ISMIR, 139–44.
Park, S., and S. Choi. 2008. Gaussian Processes for Source Separation.” In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 1909–12.
Pham, Dinh-Tuan, and Jean-François Cardoso. 2001. Blind Separation of Instantaneous Mixtures of Nonstationary Sources.” IEEE Transactions on Signal Processing 49 (9): 1837–48.
Prétet, Laure, Romain Hennequin, Jimena Royo-Letelier, and Andrea Vaglio. 2019. Singing Voice Separation: A Study on Training Data.” In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 506–10.
Qian, Kaizhi, Yang Zhang, Shiyu Chang, David Cox, and Mark Hasegawa-Johnson. 2020. Unsupervised Speech Decomposition via Triple Information Bottleneck.” arXiv:2004.11284 [Cs, Eess], August.
Routtenberg, Tirza, and Joseph Tabrikian. 2010. Blind MIMO-AR System Identification and Source Separation with Finite-Alphabet.” IEEE Transactions on Signal Processing 58 (3): 990–1000.
Särelä, Jaakko, and Harri Valpola. 2005. Denoising Source Separation.” Journal of Machine Learning Research 6 (Mar): 233–72.
Schlüter, J., and S. Böck. 2014. Improved Musical Onset Detection with Convolutional Neural Networks.” In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6979–83.
Smaragdis, Paris. 2004. Non-Negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs.” In Independent Component Analysis and Blind Signal Separation, edited by Carlos G. Puntonet and Alberto Prieto, 494–99. Lecture Notes in Computer Science. Granada, Spain: Springer Berlin Heidelberg.
Smaragdis, Paris, Cedric Fevotte, Gautham J. Mysore, Nasser Mohammadiha, and Matthew Hoffman. 2014. Static and Dynamic Source Separation Using Nonnegative Factorizations: A Unified View.” IEEE Signal Processing Magazine 31 (3): 66–75.
Smith, Jordan B L, Yuta Kawasaki, and Masataka Goto. 2019. “Unmixer: An Interface for Extracting and Remixing Loops.” In, 8.
Sprechmann, Pablo, Joan Bruna, and Yann LeCun. 2014. Audio Source Separation with Discriminative Scattering Networks.” arXiv:1412.7022 [Cs], December.
Stöter, Fabian-Robert, Stefan Uhlich, Antoine Liutkus, and Yuki Mitsufuji. 2019. Open-Unmix - A Reference Implementation for Music Source Separation.” Journal of Open Source Software 4 (41): 1667.
Tachibana, Hideyuki, Nobutaka Ono, and Shigeki Sagayama. 2014. Singing Voice Enhancement in Monaural Music Signals Based on Two-Stage Harmonic/Percussive Sound Separation on Multiple Resolution Spectrograms.” Audio, Speech, and Language Processing, IEEE/ACM Transactions on 22 (1): 228–37.
Tachibana, H., H. Kameoka, N. Ono, and S. Sagayama. 2012. Comparative Evaluations of Various Harmonic/Percussive Sound Separation Algorithms Based on Anisotropic Continuity of Spectrogram.” In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 465–68.
Tenenbaum, J. B., and W. T. Freeman. 2000. Separating Style and Content with Bilinear Models.” Neural Computation 12 (6): 1247–83.
Turner, Richard E., and Maneesh Sahani. 2014. Time-Frequency Analysis as Probabilistic Inference.” IEEE Transactions on Signal Processing 62 (23): 6171–83.
Tzinis, Efthymios, Zhepei Wang, and Paris Smaragdis. 2020. “Sudo Rm -Rf: Efficient Networks for Universal Audio Source Separation.” In, 6.
Venkataramani, Shrikant, and Paris Smaragdis. 2017. End to End Source Separation with Adaptive Front-Ends.” arXiv:1705.02514 [Cs], May.
Venkataramani, Shrikant, Y. Cem Subakan, and Paris Smaragdis. 2017. Neural Network Alternatives to Convolutive Audio Models for Source Separation.” arXiv:1709.07908 [Cs, Eess], September.
Vincent, E., N. Bertin, and R. Badeau. 2008. Harmonic and Inharmonic Nonnegative Matrix Factorization for Polyphonic Pitch Transcription.” In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 109–12.
Virtanen, T. 2007. Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria.” IEEE Transactions on Audio, Speech, and Language Processing 15 (3): 1066–74.
Virtanen, Tuomas. 2006. Unsupervised Learning Methods for Source Separation in Monaural Music Signals.” In Signal Processing Methods for Music Transcription, 267–96. Springer.
Yoshii, Kazuyoshi. 2013. “Beyond NMF: Time-Domain Audio Source Separation Without Phase Reconstruction,” 6.
———. 2018. Correlated Tensor Factorization for Audio Source Separation.” In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 731–35.
Yoshii, Kazuyoshi, Katsutoshi Itoyama, and Masataka Goto. 2016. Student’s T Nonnegative Matrix Factorization and Positive Semidefinite Tensor Factorization for Single-Channel Audio Source Separation.” In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 51–55.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.