Analysis/resynthesis of audio

2016-01-15 — 2020-04-08

Wherein Audio Is Analysed by Machine Listening, Sparse Low-Dimensional Features Are Extracted, Stochastic Models Are Fitted and Simulated for Resynthesis, and Concatenative Mosaicing From a Learned Sparse Dictionary Is Employed While Psychoacoustic Cost Functions Are Considered

algebra

generative art

machine learning

machine listening

making things

music

networks

probability

signal processing

statistics

Generative stochastic models for audio. Analyze audio using machine listening methods to decompose it into features, maybe over a sparse basis, as in learning gamelan, and possibly of low dimension due to some sparsification, source separation, maybe including some stochastic dependence, e.g. a random field or regression model of some kind. Then simulate features from that stochastic model. Depending on your cost function, how good your model fit was, and how you smoothed your data, this might produce something acoustically indistinguishable from the source, or perform concatenative synthesis from a sparse basis dictionary, or produce a parametric synthesizer software package.

There is a lot of funny business with machine learning for polyphonic audio. For a start, a naive linear-algebra-style decomposition doesn’t perform great because human acoustic perception is messy. e.g. all white noise sounds the same to us, but deterministic models need a large basis to minutely approximate it in the \(L_2\) norm. Our phase sensitivity is frequency dependent. Adjacent frequencies mask each other. Many other things I don’t know about. One could use cost functions based on psychoacoustic cochlear models, but those are tricky to synthesize from, (although possible if perhaps unsatisfying with a neural network). There are also classic alternate psychoacoustic decompositions such as the Mel Frequency Cepstral Transform, but these are even harder to invert.

1 Mosaicing synthesis

A.k.a concatenative synthesis. I’m publishing in this area.

Mosaic Style Transfer using Sparse Autocorrelograms.

More soon.

2 Neural approaches

See NNs for music.

3 Incoming

What is Loris?

4 References

Aarabi, and Peeters. 2018. “Music Retiler: Using NMF2D Source Separation for Audio Mosaicing.” In Proceedings of the Audio Mostly 2018 on Sound in Immersion and Emotion. AM’18.

Bertin, Badeau, and Vincent. 2010. “Enforcing Harmonicity and Smoothness in Bayesian Non-Negative Matrix Factorization Applied to Polyphonic Music Transcription.” IEEE Transactions on Audio, Speech, and Language Processing.

Bitton, Esling, and Chemla-Romeu-Santos. 2018. “Modulated Variational Auto-Encoders for Many-to-Many Musical Timbre Transfer.”

Blaauw, and Bonada. 2017. “A Neural Parametric Singing Synthesizer.” arXiv:1704.03809 [Cs].

Boyes. 2011. “Dictionary-Based Analysis/Synthesis and Structured Representations of Musical Audio.”

Buch, Quinton, and Sturm. 2017. “NichtnegativeMatrixFaktorisierungnutzendesKlangsynthesenSystem (NiMFKS): Extensions of NMF-Based Concatenative Sound Synthesis.” In Proceedings of the 20th International Conference on Digital Audio Effects.

Caetano, and Rodet. 2013. “Musical Instrument Sound Morphing Guided by Perceptually Motivated Features.” IEEE Transactions on Audio, Speech, and Language Processing.

Carr, and Zukowski. 2018. “Generating Albums with SampleRNN to Imitate Metal, Rock, and Punk Bands.” arXiv:1811.06633 [Cs, Eess].

Chazan, and Hoory. 2006. Feature-domain concatenative speech synthesis. United States US7035791B2.

Coleman, Graham Keith. 2015. “Descriptor Control of Sound Transformations and Mosaicing Synthesis.”

Coleman, Graham, and Bonada. 2008. “Sound Transformation by Descriptor Using an Analytic Domain.” In Proceedings of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008.

Coleman, Graham, Bonada, and Maestre. 2011. “Adding Dynamic Smoothing to Mixture Mosaicing Synthesis.”

Coleman, Graham, Maestre, and Bonada. 2010. “Augmenting Sound Mosaicing with Descriptor-Driven Transformation.” In Proceedings of DAFx-10.

Collins, and Sturm. 2011. “Sound Cross-Synthesis and Morphing Using Dictionary-Based Methods.” In International Computer Music Conference.

Cont, Dubnov, and Assayag. 2007. “GUIDAGE: A Fast Audio Query Guided Assemblage.” In.

Di Liscia. 2013. “A Pure Data Toolkit for Real-Time Synthesis of ATS Spectral Data.”

Dieleman, Oord, and Simonyan. 2018. “The Challenge of Realistic Music Generation: Modelling Raw Audio at Scale.” In Advances In Neural Information Processing Systems.

Donahue, McAuley, and Puckette. 2019. “Adversarial Audio Synthesis.” In ICLR 2019.

Driedger, and Pratzlich. 2015. “Let It Bee – Towards NMF-Inspired Audio Mosaicing.” In Proceedings of ISMIR.

Dudley. 1955. “Fundamentals of Speech Synthesis.” Journal of the Audio Engineering Society.

———. 1964. “Thirty Years of Vocoder Research.” The Journal of the Acoustical Society of America.

Elbaz, and Zibulevsky. 2017. “Perceptual Audio Loss Function for Deep Learning.” In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China.

Engel, Resnick, Roberts, et al. 2017. “Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders.” In PMLR.

Févotte, Bertin, and Durrieu. 2008. “Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis.” Neural Computation.

Godsill, and Cemgil. 2005. “Probabilistic Phase Vocoder and Its Application to Interpolation of Missing Values in Audio Signals.” In 2005 13th European Signal Processing Conference.

Goodwin, and Vetterli. 1997. “Atomic Decompositions of Audio Signals.” In 1997 IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, 1997.

Hazel. 2001. “Soundmosaic.” Web Page.

Hoffman, Matthew D, Blei, and Cook. 2010. “Bayesian Nonparametric Matrix Factorization for Recorded Music.” In International Conference on Machine Learning.

Hoffman, Matt, and Cook. 2006. “Feature-Based Synthesis: A Tool for Evaluating, Designing, and Interacting with Music IR Systems.” In Proceedings of ISMIR.

Hoffman, Matt, and Cook. 2007. “Real-Time Feature-Based Synthesis for Live Musical Performance.” In Proceedings of the 7th International Conference on New Interfaces for Musical Expression.

Hoffman, Matt, and Cook. n.d. “Feature-Based Synthesis: Mapping Acoustic and Perceptual Features onto Synthesis Parameters.”

Hohmann. 2002. “Frequency Analysis and Synthesis Using a Gammatone Filterbank.” Acta Acustica United with Acustica.

Kersten, and Purwins. 2012. “Fire Texture Sound Re-Synthesis Using Sparse Decomposition and Noise Modelling.” In International Conference on Digital Audio Effects (DAFx12).

Lazier, and Cook. 2003. “Mosievius: Feature Driven Interactive Audio Mosaicing.”

Masri, Bateman, and Canagarajah. 1997a. “A Review of Time–Frequency Representations, with Application to Sound/Music Analysis–Resynthesis.” Organised Sound.

———. 1997b. “The Importance of the Time–Frequency Representation for Sound/Music Analysis–Resynthesis.” Organised Sound.

Mehri, Kumar, Gulrajani, et al. 2017. “SampleRNN: An Unconditional End-to-End Neural Audio Generation Model.” In Proceedings of International Conference on Learning Representations (ICLR) 2017.

Morise. 2016. “D4C, a Band-Aperiodicity Estimator for High-Quality Speech Synthesis.” Speech Commun.

Morise, Yokomori, and Ozawa. 2016. “WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications.” IEICE Transactions on Information and Systems.

Morrill. 2022. “On The Euclidean Algorithm: Rhythm Without Recursion.”

Mor, Wolf, Polyak, et al. 2018. “A Universal Music Translation Network.” arXiv:1805.07848 [Cs, Stat].

Müller, Ellis, Klapuri, et al. 2011. “Signal Processing for Music Analysis.” IEEE Journal of Selected Topics in Signal Processing.

O’Leary, and Röbel. 2016. “A Montage Approach to Sound Texture Synthesis.” IEEE/ACM Trans. Audio, Speech and Lang. Proc.

Ó Nuanáin, Jordà Puig, and Herrera Boyer. 2016. “An Interactive Software Instrument for Real-Time Rhythmic Concatenative Synthesis.”

Pascual, Serrà, and Bonafonte. 2019. “Towards Generalized Speech Enhancement with Generative Adversarial Networks.” arXiv:1904.03418 [Cs, Eess].

Salamon, Serrà, and Gómez. 2013. “Tonal Representations for Music Retrieval: From Version Identification to Query-by-Humming.” International Journal of Multimedia Information Retrieval.

Sarroff, and Casey. 2014. “Musical Audio Synthesis Using Autoencoding Neural Nets.” In.

Schimbinschi, Walder, Erfani, et al. 2018. “Synthnet: Learning Synthesizers End-to-End.”

Scholler, and Purwins. 2011. “Sparse Approximations for Drum Sound Classification.” IEEE Journal of Selected Topics in Signal Processing.

Schwarz. 2005. “Current Research in Concatenative Sound Synthesis.” In International Computer Music Conference (ICMC).

———. 2011. “State of the Art in Sound Texture Synthesis.” In Proceedings of DAFx-11.

Serra, and Smith. 1990. “Spectral Modeling Synthesis: A Sound Analysis/Synthesis System Based on a Deterministic Plus Stochastic Decomposition.” Computer Music Journal.

Simon, Basu, Salesin, et al. 2005. “Audio Analogies: Creating New Music from an Existing Performance by Concatenative Synthesis.” In Proceedings of the 2005 International Computer Music Conference.

Smaragdis, and Brown. 2003. “Non-Negative Matrix Factorization for Polyphonic Music Transcription.” In Applications of Signal Processing to Audio and Acoustics, 2003 IEEE Workshop on.

Sturm, Daudet, and Roads. 2006. “Pitch-Shifting Audio Signals Using Sparse Atomic Approximations.” In Proceedings of the 1st ACM Workshop on Audio and Music Computing Multimedia. AMCMM ’06.

Sturm, Roads, McLeran, et al. 2009. “Analysis, Visualization, and Transformation of Audio Signals Using Dictionary-Based Methods.” Journal of New Music Research.

Su, Chiu, Su, et al. 2017. “Automatic Conversion of Pop Music into Chiptunes for 8-Bit Pixel Art.” In.

Tenenbaum, and Freeman. 2000. “Separating Style and Content with Bilinear Models.” Neural Computation.

Turner, and Sahani. 2014. “Time-Frequency Analysis as Probabilistic Inference.” IEEE Transactions on Signal Processing.

Uhrenholt, and Jensen. 2019. “Efficient Bayesian Optimization for Target Vector Estimation.” In The 22nd International Conference on Artificial Intelligence and Statistics.

Vasquez, and Lewis. 2019. “MelNet: A Generative Model for Audio in the Frequency Domain.” arXiv:1906.01083 [Cs, Eess, Stat].

Verhelst, and Roelands. 1993. “An Overlap-Add Technique Based on Waveform Similarity (WSOLA) for High Quality Time-Scale Modification of Speech.” In Proceedings of ICASSP. ICASSP’93.

Verma, T.S., and Meng. 1998. “An Analysis/Synthesis Tool for Transient Signals That Allows a Flexible Sines+transients+noise Model for Audio.” In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181).

———. 1999. “Sinusoidal Modeling Using Frame-Based Perceptually Weighted Matching Pursuits.” In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

Verma, Prateek, and Smith. 2018. “Neural Style Transfer for Audio Spectograms.” In 31st Conference on Neural Information Processing Systems (NIPS 2017).

Vincent, Bertin, and Badeau. 2008. “Harmonic and Inharmonic Nonnegative Matrix Factorization for Polyphonic Pitch Transcription.” In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

Virtanen. 2007. “Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria.” IEEE Transactions on Audio, Speech, and Language Processing.

Wager, Chen, Kim, et al. 2017. “Towards Expressive Instrument Synthesis Through Smooth Frame-by-Frame Reconstruction: From String to Woodwind.” In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Wyse. 2017. “Audio Spectrogram Representations for Processing with Convolutional Neural Networks.” In Proceedings of the First International Conference on Deep Learning and Music, Anchorage, US, May, 2017 (arXiv:1706.08675v1 [Cs.NE]).

Zhou, Horgan, Kumar, et al. 2018. “Voice Conversion with Conditional SampleRNN.” arXiv:1808.08311 [Cs, Eess].

Zils, and Pachet. 2001. “Musical Mosaicing.” In Proceedings of DAFx-01.