Audio source separation

2019-11-04 — 2019-11-26

Wherein the decomposition of commercial recordings into stems is described, it is noted that scarce isolated-track corpora constrain training and that models are used to yield separate vocals, drums, bass, and accompaniment

algebra

generative art

machine learning

machine listening

making things

music

networks

probability

signal processing

statistics

Decomposing audio into discrete sources, especially commercial tracks into stems. This is in large part a problem of data acquisition since artists do not usually release unmixed versions of their tracks.

The taxonomy here comes from Jordi Pons’ tutorial Waveform-based music processing with deep learning.

1 Neural approaches

In the time domain, Facebook’s demucs gets startlingly good performance on MusDB (startlingly good in that, if I have understood correctly, they train only on MuseDB which is a small data set compared to what the big players such as Spotify have access to, so they must have done well with good priors).

In the spectral domain, Deezer has released Spleeter (Hennequin et al. 2019)

Spleeter is the Deezer source separation library with pretrained models written in Python and uses Tensorflow. It makes it easy to train source separation models (assuming you have a dataset of isolated sources), and provides already trained state-of-the-art models for performing various flavours of separation:

Vocals (singing voice) / accompaniment separation (2 stems)

Vocals / drums / bass / other separation (4 stems)

Vocals / drums / bass / piano / other separation (5 stems)

They are competing with open unmix, (Stöter et al. 2019) which also looks neat and has a live demo.

Wave-U-Net architectures seem popular if one wants to DIY.

GANs seem natural here, although most methods are supervised, or better, a probabilistic method.

1.1 Non-negative matrix factorisation approaches

Factorise the spectrogram! Authors such as (T. Virtanen 2007; Bertin, Badeau, and Vincent 2010; Vincent, Bertin, and Badeau 2008; Févotte, Bertin, and Durrieu 2008; Smaragdis 2004) popularised using non-negative matrix factorisations to identify the “activations” of power spectrograms for music analysis. It didn’t take long for this to be used in resynthesis tasks, by e.g. Aarabi and Peeters (2018), Buch, Quinton, and Sturm (2017) (source, site), Driedger and Pratzlich (2015) (site), (Hoffman, Blei, and Cook 2010). Of course, these methods leave you with a phase retrieval problem. These work really well in resynthesis, where you do not care so much about audio bleed.

UNMIXER (Smith, Kawasaki, and Goto 2019) is a nifty online interface to loop decomposition in this framework.

1.2 Harmonic-percussive source separation

Harmonic Percussive separation needs explanation. 🚧TODO🚧 clarify (Hideyuki Tachibana, Ono, and Sagayama 2014; Driedger, Muller, and Ewert 2014; FitzGerald et al. 2013; Lakatos 2000; N. Ono et al. 2008; Fitzgerald 2010; H. Tachibana et al. 2012; Driedger, Müller, and Disch 2014; Nobutaka Ono et al. 2008; Schlüter and Böck 2014; Laroche et al. 2017; Elowsson and Friberg 2017; Driedger and Müller 2016)

2 Noise+sinusoids

That first step might be to find some model which can approximately capture the cyclic and disordered components of the signal. Indeed Metamorph and smstools, based on a “sinusoids+noise” model do this kind of decomposition, but they mostly use it for resynthesis in limited ways, not simulating realisations from the inferred model of an underlying stochastic process. There is an implementation in csound called ATS which looks interesting?

Some non-parametric conditional wavelet density sounds more fun to me, maybe as a Markov random field — although what exact generative model I would fit is still opaque to me. The sequence probably possesses multiple at scales, and there is evidence that music might have a recursive grammatical structure which would be hard to learn even if we had a perfect decomposition.

3 Incoming

Live examples:

4 References

Aarabi, and Peeters. 2018. “Music Retiler: Using NMF2D Source Separation for Audio Mosaicing.” In Proceedings of the Audio Mostly 2018 on Sound in Immersion and Emotion. AM’18.

Alvarado, Alvarez, and Stowell. 2019. “Sparse Gaussian Process Audio Source Separation Using Spectrum Priors in the Time-Domain.” In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Alvarado, and Stowell. 2018. “Efficient Learning of Harmonic Priors for Pitch Detection in Polyphonic Music.” arXiv:1705.07104 [Cs, Stat].

Bertin, Badeau, and Vincent. 2010. “Enforcing Harmonicity and Smoothness in Bayesian Non-Negative Matrix Factorization Applied to Polyphonic Music Transcription.” IEEE Transactions on Audio, Speech, and Language Processing.

Blaauw, and Bonada. 2017. “A Neural Parametric Singing Synthesizer.” arXiv:1704.03809 [Cs].

Blumensath, and Davies. 2006. “Sparse and Shift-Invariant Representations of Music.” IEEE Transactions on Audio, Speech and Language Processing.

Buch, Quinton, and Sturm. 2017. “NichtnegativeMatrixFaktorisierungnutzendesKlangsynthesenSystem (NiMFKS): Extensions of NMF-Based Concatenative Sound Synthesis.” In Proceedings of the 20th International Conference on Digital Audio Effects.

Cichocki, Zdunek, and Amari. 2006. “New Algorithms for Non-Negative Matrix Factorization in Applications to Blind Source Separation.” In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

de Castro, and Dorigo. 2019. “INFERNO: Inference-Aware Neural Optimisation.” Computer Physics Communications.

Driedger, and Müller. 2016. “A Review of Time-Scale Modification of Music Signals.” Applied Sciences.

Driedger, Müller, and Disch. 2014. “Extending Harmonic-Percussive Separation of Audio Signals.” In ISMIR.

Driedger, Muller, and Ewert. 2014. “Improving Time-Scale Modification of Music Signals Using Harmonic-Percussive Separation.” IEEE Signal Processing Letters.

Driedger, and Pratzlich. 2015. “Let It Bee – Towards NMF-Inspired Audio Mosaicing.” In Proceedings of ISMIR.

Elowsson, and Friberg. 2017. “Long-Term Average Spectrum in Popular Music and Its Relation to the Level of the Percussion.” In Audio Engineering Society Convention 142.

Févotte, Bertin, and Durrieu. 2008. “Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis.” Neural Computation.

Févotte, and Idier. 2011. “Algorithms for Nonnegative Matrix Factorization with the β-Divergence.” Neural Computation.

Fitzgerald. 2010. “Harmonic/Percussive Separation Using Median Filtering.”

FitzGerald, Liukus, Rafii, et al. 2013. “Harmonic/Percussive Separation Using Kernel Additive Modelling.” In Irish Signals & Systems Conference 2014 and 2014 China-Ireland International Conference on Information and Communications Technologies (ISSC 2014/CIICT 2014). 25th IET.

Grais, Ward, and Plumbley. 2018. “Raw Multi-Channel Audio Source Separation Using Multi-Resolution Convolutional Auto-Encoders.” arXiv:1803.00702 [Cs].

Gribonval. 2003. “Piecewise Linear Source Separation.” In Proc. Soc. Photographic Instrumentation Eng.

Helén, and Virtanen. 2005. “Separation of Drums from Polyphonic Music Using Non-Negative Matrix Factorization and Support Vector Machine.” In Signal Processing Conference, 2005 13th European.

Hennequin, Khlif, Voituret, et al. 2019. “Spleeter: A Fast and State-of-the Art Music Source Separation Tool with Pre-Trained Models.” In.

Hoffman, Blei, and Cook. 2010. “Bayesian Nonparametric Matrix Factorization for Recorded Music.” In International Conference on Machine Learning.

Hsieh, and Chien. 2011. “Nonstationary and Temporally Correlated Source Separation Using Gaussian Process.” In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Jayaram, and Thickstun. 2020. “Source Separation with Deep Generative Priors.” arXiv:2002.07942 [Cs, Stat].

Klapuri, Virtanen, and Heittola. 2010. “Sound Source Separation in Monaural Music Signals Using Excitation-Filter Model and Em Algorithm.” In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

Lakatos. 2000. “A Common Perceptual Space for Harmonic and Percussive Timbres.” Perception & Psychophysics.

Laroche, Papadopoulos, Kowalski, et al. 2017. “Drum Extraction in Single Channel Audio Signals Using Multi-Layer Non Negative Matrix Factor Deconvolution.” In ICASSP.

Leglaive, Badeau, and Richard. 2017. “Multichannel Audio Source Separation: Variational Inference of Time-Frequency Sources from Time-Domain Observations.” In 42nd International Conference on Acoustics, Speech and Signal Processing (ICASSP). Proc. 42nd International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Levin. 2017. “The Inner Structure of Time-Dependent Signals.” arXiv:1703.08596 [Cs, Math, Stat].

Liu, Thoshkahna, Milani, et al. 2020. “Voice and Accompaniment Separation in Music Using Self-Attention Convolutional Neural Network.”

Liutkus, Badeau, and Richard. 2011. “Gaussian Processes for Underdetermined Source Separation.” IEEE Transactions on Signal Processing.

Liutkus, Rafii, Pardo, et al. 2014. “Kernel Spectrogram Models for Source Separation.” In.

Ma, Green, Barker, et al. 2007. “Exploiting Correlogram Structure for Robust Speech Recognition with Multiple Speech Sources.” Speech Communication.

Miron, Carabias-Orti, Bosch, et al. 2016. “Score-Informed Source Separation for Multichannel Orchestral Recordings.” Journal of Electrical and Computer Engineering.

Ó Nuanáin, Jordà Puig, and Herrera Boyer. 2016. “An interactive software instrument for real-time rhythmic concatenative synthesis.”

Ono, Nobutaka, Miyamoto, Kameoka, et al. 2008. “A Real-Time Equalizer of Harmonic and Percussive Components in Music Signals.” In ISMIR.

Ono, N., Miyamoto, Le Roux, et al. 2008. “Separation of a Monaural Audio Signal into Harmonic/Percussive Components by Complementary Diffusion on Spectrogram.” In Signal Processing Conference, 2008 16th European.

Park, and Choi. 2008. “Gaussian Processes for Source Separation.” In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

Pham, and Cardoso. 2001. “Blind Separation of Instantaneous Mixtures of Nonstationary Sources.” IEEE Transactions on Signal Processing.

Prétet, Hennequin, Royo-Letelier, et al. 2019. “Singing Voice Separation: A Study on Training Data.” In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Qian, Zhang, Chang, et al. 2020. “Unsupervised Speech Decomposition via Triple Information Bottleneck.” arXiv:2004.11284 [Cs, Eess].

Routtenberg, and Tabrikian. 2010. “Blind MIMO-AR System Identification and Source Separation with Finite-Alphabet.” IEEE Transactions on Signal Processing.

Särelä, and Valpola. 2005. “Denoising Source Separation.” Journal of Machine Learning Research.

Schlüter, and Böck. 2014. “Improved Musical Onset Detection with Convolutional Neural Networks.” In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Smaragdis. 2004. “Non-Negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs.” In Independent Component Analysis and Blind Signal Separation. Lecture Notes in Computer Science.

Smaragdis, Fevotte, Mysore, et al. 2014. “Static and Dynamic Source Separation Using Nonnegative Factorizations: A Unified View.” IEEE Signal Processing Magazine.

Smith, Kawasaki, and Goto. 2019. “Unmixer: An Interface for Extracting and Remixing Loops.” In.

Sprechmann, Bruna, and LeCun. 2014. “Audio Source Separation with Discriminative Scattering Networks.” arXiv:1412.7022 [Cs].

Stöter, Uhlich, Liutkus, et al. 2019. “Open-Unmix - A Reference Implementation for Music Source Separation.” Journal of Open Source Software.

Tachibana, H., Kameoka, Ono, et al. 2012. “Comparative Evaluations of Various Harmonic/Percussive Sound Separation Algorithms Based on Anisotropic Continuity of Spectrogram.” In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Tachibana, Hideyuki, Ono, and Sagayama. 2014. “Singing Voice Enhancement in Monaural Music Signals Based on Two-Stage Harmonic/Percussive Sound Separation on Multiple Resolution Spectrograms.” Audio, Speech, and Language Processing, IEEE/ACM Transactions on.

Tenenbaum, and Freeman. 2000. “Separating Style and Content with Bilinear Models.” Neural Computation.

Turner, and Sahani. 2014. “Time-Frequency Analysis as Probabilistic Inference.” IEEE Transactions on Signal Processing.

Tzinis, Wang, and Smaragdis. 2020. “Sudo Rm -Rf: Efficient Networks for Universal Audio Source Separation.” In.

Venkataramani, and Smaragdis. 2017. “End to End Source Separation with Adaptive Front-Ends.” arXiv:1705.02514 [Cs].

Venkataramani, Subakan, and Smaragdis. 2017. “Neural Network Alternatives to Convolutive Audio Models for Source Separation.” arXiv:1709.07908 [Cs, Eess].

Vincent, Bertin, and Badeau. 2008. “Harmonic and Inharmonic Nonnegative Matrix Factorization for Polyphonic Pitch Transcription.” In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

Virtanen, Tuomas. 2006. “Unsupervised Learning Methods for Source Separation in Monaural Music Signals.” In Signal Processing Methods for Music Transcription.

Virtanen, T. 2007. “Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria.” IEEE Transactions on Audio, Speech, and Language Processing.

Yoshii. 2013. “Beyond NMF: Time-Domain Audio Source Separation Without Phase Reconstruction.”

———. 2018. “Correlated Tensor Factorization for Audio Source Separation.” In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Yoshii, Itoyama, and Goto. 2016. “Student’s T Nonnegative Matrix Factorization and Positive Semidefinite Tensor Factorization for Single-Channel Audio Source Separation.” In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).