Machine listening

Statistical models for audio

Machine listening! My preferred term for a diverse field that is also called Music Information Retrieval, speech processing and probably other names.

I’m not going to talk about speech recognition here; That boat is full.

Machine listening: machine learning, from audio. Everything from that Shazam app doohickey, to teaching computers to recognise speech, to doing artsy things with sound. I’m mostly concerned with the third one. Statistics, features, descriptors, metrics, kernels and affinities and the spaces and topologies they induce. for musical audio e.g. your MP3 pop song library. This has considerable overlap with musical metrics but there I start from scores and transcriptions.

Polyphony and its problems.

Approximate logarithmic perception and its problems.

Should I create a separate psychoacoustics notebook? Yes.

Should I create a separate features notebook? Yes.

See also musical corpora, musical metrics synchronisation, sparse basis dictionaries, speech recognition, learning gamelan, analysis/resynthesis, etc.


See auditory features.


Here are some options for doing machine listening.


musicbricks is an umbrella project to unify (post hoc) many of the efforts mentioned individually below, plus a few other new ones.

  • Fraunhofer ML software (C++) is part of this project, including such things as

    • Real-Time Pitch Detection
    • MusicBricksTranscriber
    • Goatify Pdf
    • Time Stretch Pitch Shift Library


LibROSA I have been using a lot recently, and I highly recommend it, especially if your pipeline already includes python. Sleek minimal design, with a curated set of algorithms (compare and contrast with the chaos of the vamp plugins ecosystem). Python-based, but fast enough because it uses the numpy numerical libraries well. The API design meshes well with Scikit-learn, the de facto python machine learning standard, and it’s flexible and hackable.

  • see also talkbox for a nice-looking but abandoned (?) alternative, which is nonetheless worth it for Alexander Schindler’s lovely MIR lecture based around it.

  • amen is a remix program built on librosa


SonicAnnotator seems to be about cobbling together vamp plugins for batch analysis. That is more steps that I want in an already clunky workflow in the current projects It’s also more about RDF ontologies where I want matrices of floats.


For C++ and Python there is Essentia, as seen in Freesound, which is a high recommendation IMO. (Watch out, the source download is enormous; just shy of half a gigbyte.) Features python and vamp integration, and a great many algorithms. I haven’t given it a fair chance because LibROSA has been such a joy to use. However, the intriguing Dunya project is based off it.


echonest is (was?) a proprietary system that was used to generate the Million Songs Database. Seems to be gradually decaying, and was bought up by spotify. has great demos, such as autocanonisation.



is a tool designed for the extraction of annotations from audio signals. Its features include segmenting a sound file before each of its attacks, performing pitch detection, tapping the beat and producing midi streams from live audio.…

aubio currently provides the following features:

  • digital filters
  • phase vocoder
  • onset detection (several methods)
  • pitch tracking (several methods)
  • beat and tempo tracking
  • mel frequency cepstrum coefficients (MFCC)
  • transient / steady-state separation

…aubio is written in C and is known to run on most modern architectures and platforms.

The Python interface has been written in C so that aubio arrays can be viewed directly in Python as NumPy arrays. This makes the aubio module quite efficient, not to say fast.



RP extract


phonological corpus tools

Speech-focussed, phonological corpus tools is another research library for largeish corpus analysis, similarity-classification etc. As presaged, I’m not really here for speech stuff.

Metamorph, smstools

John Glover, soundcloud staffer, has several analysis libraries culminating in Metamorph,

a new open source library for performing high-level sound transformations based on a sinusoids plus noise plus transients model. It is written in C++, can be built as both a Python extension module and a Csound opcode, and currently runs on Mac OS X and Linux.

It is designed to work primarily on monophonic, quasi-harmonic sound sources and can be used in a non-real-time context to process pre-recorded sound files or can operate in a real-time (streaming) mode.

See also the related spectral modeling and synthesis package, smstools.

Sinusoidal modelling with simplsound

“sinusoidal modelling”: Simplsound (Glover, Lazzarini, and Timoney 2009) is a python implementation of that technique.


If you use a lot of Supercollider, you might like SCMIR, a native supercollider thingy. It has the virtues that

  • it can run in realtime, which is lovely.

It has the vices that

  • It runs in Supercollider, which is a backwater language unserviced by modern development infrastructure, or decent machine learning libraries, and

  • a fraught development process; I can’t even link directly to it because the author doesn’t provide it its own anchor tag, let alone a whole web page or source code repository. Release schedule is opaque and sporadic. Consequently, it is effectively a lone guy’s pet project, rather than an active community endeavour.

    That is to say this is the Etsy sweater of MIR. If on balance this sounds like a snug deal to you, you can download SCMIR from somewhere or other on Nick Collins’ homepage.