Analysis/resynthesis of audio

Generative stochastic models for audio. Analyse audio using machine listening methods to decompose it into features, maybe over a sparse basis, as in learning gamelan, and possibly of low dimension due to some sparsification, source separation, maybe including with some stochastic dependence, e.g. a random field or regression model of some kind. Then simulate features from that stochastic model. Depending what your cost function was and how good your model fit was and how you smoothed your data, this might produce something acoustically indistinguishable from the source, or have performed concatenative synthesis from a sparse basis dictionary, or have produced a parametric synthesizer software package.

There is a lot of funny business with machine learning for polyphonic audio. For a start, a naive linear-algebra-style decomposition doesn’t perform great because human acoustic perception is messy. e.g. all white noise sounds the same to us, but deterministic models need a large basis to minutely approximate it in the \(L_2\) norm. Our phase sensitivity is frequency dependent. Adjacent frequencies mask each other. Many other things I don’t know about. One could use cost functions based on psychoacoustic cochlear models, but those are tricky to synthesize from, (although possible if perhaps unsatisfying with a neural network). There are also classic alternate psychoacoustic decompositions such as the Mel Frequency Cepstral Transform, but these are even harder to invert.

Mosaicing synthesis

A.k.a concatenative synthesis. I’m publishing in this area.

More soon.

Neural approaches

Sander Dielemann on waveform-domain neural syntehsis.

The most exciting approach here for me is Differentiable DSP. Project site. Github. twitter intro. paper. Online supplement. Timbre transfer example Tutorials.

Also around the place are other stunts, e.g.


Existing generative models for audio have predominantly aimed to directly model time-domain waveforms. MelNet instead aims to model the frequency content of an audio signal. MelNet can be used to model audio unconditionally, making it capable of tasks such as music generation. It can also be conditioned on text and speaker, making it applicable to tasks such as text-to-speech and voice conversion.

Matt Vitelli on music generation from MP3s (source).

Soundtracking audio from video.

Alex Graves on RNN predictive synthesis.

Parag Mittal on RNN style transfer.

Andy Sarrof, Musical Audio Synthesis Using Autoencoding Neural Nets. (code)

Pixelrnn turns out to be good at music Dadabots have successfully weaponised samplernn and it’s cute.

Jlin and Holly Herndon show off some artistic use of messed-up neural nets.

Hung-yi Lee and Yu Tsao, Generative Adversarial nets for DSP.


What is Loris?

