Audio/music corpora

Smells like Team Audioset

Datasets of sound tend to be called audio corpora for reasons of tradition. I’ve listed some audio corpora that are useful to me, which means

  • I need access to raw data (MIDI or audio), not someone else’s features
  • I’d like some annotations that I can try to predict, such as genre (whatever that is?), or something less suggestive such as tempp.
  • I don’t care about fancier types of metadata for now (social tagging or whatever); that’s a different level of abstraction to my current projects.

General issues

In these datasets, labels are noisy, as exemplified by Audioset. Weakly supervised techniques are worth considering.

mirdata (Bittner et al. 2019) is a swiss army knife universal API to Music Information Retrieval data. (i.e. songs with some metadata).

This library provides tools for working with common MIR datasets, including tools for:

  • downloading datasets to a common location and format
  • validating that the files for a dataset are all present
  • loading annotation files to a common format, consistent with the format required by mir_eval
  • parsing track level metadata for detailed evaluations

percussive annotator is a simple webapp UI for annotating audio. Made for Ramires et al. (2020b).

Other indices

General audio


Audioset (Gemmeke et al. 2017)):

Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets – principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 635 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.



YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs and associated labels from a diverse vocabulary of 4700+ visual entities. It comes with precomputed state-of-the-art audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This makes it possible to get started on this dataset by training a baseline video model in less than a day on a single machine! At the same time, the dataset’s scale and diversity can enable deep exploration of complex audio-visual models that can take weeks to train even in a distributed fashion.

Our goal is to accelerate research on large-scale video understanding, representation learning, noisy data modeling, transfer learning, and domain adaptation approaches for video. More details about the dataset and initial experiments can be found in our technical report. Some statistics from the latest version of the dataset are included below.


FSDnoisy18k: (Fonseca et al. 2019)]

FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

The source of audio content is Freesound—a sound sharing site created an maintained by the Music Technology Group hosting over 400,000 clips uploaded by its community of users, who additionally provide some basic metadata (e.g., tags, and title). More information about the technical aspects of Freesound can be found in [14, 15].

The 20 classes of FSDnoisy18k are drawn from the AudioSet Ontology and are selected based on data availability as well as on their suitability to allow the study of label noise. They are listed in the next Table, along with the number of audio clips per class (split in different data subsets that are defined in section FSDnoisy18k basic characteristics). Every numeric entry in the next table reads: number of clips / duration in minutes (rounded). For instance, the Acoustic guitar class has 102 audio clips in the clean subset of the train set, and the total duration of these clips is roughly 11 minutes.

We defined a clean portion of the dataset consisting of correct and complete labels. The remaining portion is referred to as the noisy portion. Each clip in the dataset has a single ground truth label (singly-labeled data).

The clean portion of the data consists of audio clips whose labels are rated as present in the clip and predominant (almost all with full inter-annotator agreement), meaning that the label is correct and, in most cases, there is no additional acoustic material other than the labeled class. A few clips may contain some additional sound events, but they occur in the background and do not belong to any of the 20 target classes. This is more common for some classes that rarely occur alone, e.g., “Fire”, “Glass”, “Wind” or “Walk, footsteps”.

The noisy portion of the data consists of audio clips that received no human validation. In this case, they are categorized on the basis of the user-provided tags in Freesound. Hence, the noisy portion features a certain amount of label noise.


We are happy to announce the release of FUSS: the Free Universal Sound Separation dataset.

Audio recordings often contain a mixture of different sound sources; Universal sound separation is the ability to separate such a mixture into its component sounds, regardless of the types of sound present. Previously, sound separation work has focused on separating mixtures of a small number of sound types, such as “speech” versus “nonspeech”, or different instances of the same type of sound, such as speaker #1 versus speaker #2. Often in such work, the number of sounds in a mixture is also assumed to be known a priori. The FUSS dataset shifts focus to the more general problem of separating a variable number of arbitrary sounds from one another.

One major hurdle to training models in this domain is that even if you have high-quality recordings of sound mixtures, you can’t easily annotate these recordings with ground truth. High-quality simulation is one approach to overcome this limitation. To achieve good results, you need a diverse set of sounds, a realistic room simulator, and code to mix these elements together for realistic, multi-source, multi-class audio with ground truth. With FUSS, we are releasing all three of these.

Musical – complete songs and metadata

tl;dr Use Free Music Archive, which is faster, simpler and higher quality. Youtube Music Videos and Magatagatune have more noise and less convenience. Which is not to say that FMA is fast, simple and of high quality. It’s still chaos because the creators are all busy doing their own dissertations I guess, but it is better than the alternatives.

Beatstars and other beat markets

Beatstars is a commercial peer-to-peer beatmakers’ market which has lots of tracks without vocals. I’m not sure what analysis can be done upon them without paying the per-track licensing fee, but possibly quite a lot.

Free music archive

FMA (Defferrard et al. 2017) is an annotated, ML-ready dataset constructed from the Free Music Archive

We introduce the Free Music Archive (FMA), an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and organizing large music collections. The community’s growing interest in feature and end-to-end learning is however restrained by the limited availability of large audio datasets. The FMA aims to overcome this hurdle by providing 917 GiB and 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. It provides full-length and high-quality audio, pre-computed features, together with track- and user- level metadata, tags, and free-form text such as biographies. We here describe the dataset and how it was created, propose a train/validation/test split and three subsets, discuss some suitable MIR tasks, and evaluate some baselines for genre recognition.

Oh wait, now free music archive’s closed down. Please see the backup of the audio. The metadata in the datset migth be the best options.

Note that at time of writing, the default code version of the did not use the same format as the default data version. I needed to check out a particular, elderly version of the code, rc1. I suppose you could rebuild the dataset index from scratch yourself, but this would need much CPU time.

Youtube music video 8m

The musical competitor to the above non-musical one, created by Keunwoo Choi. Youtube Music Video 8m:

[…] there are loads of music video online […] unlike in the case you’re looking for a dataset that provide music signal, you can just download music contents. For free. No API blocking. No copyright law bans you to do so (redistribution is restricted though). Not just 30s preview but the damn whole song. Just because it’s not mp3 but mp4.

As a MIR researcher I found it annoying but blessing. {music} is banned but {music video} is not! […]

I beta-released YouTube Music Video 5M dataset. The readme has pretty much everything and will be up-to-date. In short, It’s 5M+ youtube music video URLs that are categorised by Spotify artist IDs which are sorted by some sort of artist popularity.

Unfortunately I can’t redistribute anything further that is crawled from either YouTube (e.g., music video titles) or Spotify (e.g., artist genre labels) but you can get them by yourself.

The spotify metadata is pretty good, and the youtube audio quality is OK, so this is a handy dataset. But much assembly required, and much bandwidth.

Source code.

Dunya project

Universitat Pompeu-Fabra is trying to collect large and minutely analyzed sets of data from several distinct not-necessarily-central-european traditions, and have comprehensive software tools too:

See the Dunya project site.

The ballroom set

The ballroom music data set (Gouyon et al. 2006):

In this document we report on the tempo induction contest held as part of the ISMIR 2004 Audio Description Contests, organized at the University Pompeu Fabra in Barcelona in September 2004 and won by Anssi Klapuri from Tampere University. […] gives many informations [sic] on ballroom dancing (online lessons, etc.). Some characteristic excerpts of many dance styles are provided in real audio format. Their tempi are also available.

  • Total number of instances: 698

  • Duration: ~30 s

  • Total duration: ~20940 s

  • Genres

    • Cha Cha, 111
    • Jive, 60
    • Quickstep 82
    • Rumba, 98
    • Samba, 86
    • Tango, 86
    • Viennese Waltz, 65
    • Slow Waltz, 110



MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note every recording, the instrument that plays each note, and the note’s position in the metrical structure of the composition. The labels are acquired from musical scores aligned to recordings by dynamic time warping. The labels are verified by trained musicians; we estimate a labeling error rate of 4%. We offer the MusicNet labels to the machine learning and music communities as a resource for training models and a common benchmark for comparing results.

Data with purely automatically generated labels (they inferred the annotation from the raw samples using DSP, giving larger data sets and more errors than hand-labelled stuff.)


Magnatagatune is a data set of pop songs with substantial audio and metadata, good for classification tasks. Announcement. (Data)

This dataset consists of ~25000 29s long music clips, each of them annotated with a combination of 188 tags. The annotations have been collected through Edith’s “TagATune” game. The clips are excerpts of songs published by

There is a list of articles using this data set.

Music – multitrack/stems

🏗 see if spotify made their acapella/whole song database available.

Music stems sets of suspicious provenance, via Reddit, were on r/SongStems, but it turned out, unsuprsignly that they were banned for copyright violations .

They have been replaced by a torrent?

Sisec signal separation

A.k.a. MusDB a.k.a. Mus18.

Sisec sigsep did a multitrack musical separation contest

Professionally-produced music recordings

New dataset and Python tools to handle it 100 training songs, 50 test songs, all stereo@44.1 kHz produced recordings All songs include drums, bass, other, vocals stems Songs are encoded in the Native Instruments stems format, with a tool to convert them back and forth to wav and to load them directly in Python. Automatic download of data Python code to analyze results and produce plots for your own paper.

The actual dataset is AFAICT called musdb.


MedleyDB (Bittner et al. 2014):

MedleyDB, a dataset of annotated, royalty-free multitrack recordings. MedleyDB was curated primarily to support research on melody extraction, addressing important shortcomings of existing collections. For each song we provide melody f0 annotations as well as instrument activations for evaluating automatic instrument recognition. The dataset is also useful for research on tasks that require access to the individual tracks of a song such as source separation and automatic mixing…

Dataset Snapshot

  • Size: 122 Multitracks (mix + processed stems + raw audio + metadata)

  • Annotations

    • Melody f0 (108 tracks)
    • Instrument Activations (122 tracks)
    • Genre (122 tracks)
  • Audio Format: WAV (44.1 kHz,16 bit)

  • Genres:

    • Singer/Songwriter
    • Classical
    • Rock
    • World/Folk
    • Fusion
    • Jazz
    • Pop,
    • Musical Theatre
    • Rap
  • Track Length

    • 105 full length tracks (~3 to 5 minutes long)
    • 17 excerpts (7:17 hours total)
  • Instrumentation

    • 52 instrumental tracks
    • 70 tracks containing vocals

Bonus: Mdbdrums

Mdbdrums (Southall et al. 2017)

This repository contains the MDB Drums dataset which consists of drum annotations and audio files for 23 tracks from the MedleyDB dataset.

Two annotation files are provided for each track. The first annotation file, termed class, groups the 7994 onsets into 6 classes based on drum instrument. The second annotation file, termed subclass, groups the onsets into 21 classes based on playing technique and instrument.

Musical – just metadata

tl;dr If I am using these datasets, I am using someone else’s outdated metadata. It will probably be better if I just benchmark against published research using these databases than reinventing my own wheel. Or maybe I would use these if I wanted to augment a dataset you already have? I’d want to be sure I could match these dataset together with high certainty in that case. Rumour holds that stitching some of these datasets together is challenging.

Million songs

You have to mention this one because everyone does, but it’s useless for me.

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Its purposes are:

  • To encourage research on algorithms that scale to commercial sizes

  • To provide a reference dataset for evaluating research

  • As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest’s)

  • To help new researchers get started in the MIR field

The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. […]

The Million Song Dataset is also a cluster of complementary datasets contributed by the community:

  • SecondHandSongs dataset -> cover songs
  • musiXmatch dataset -> lyrics
  • dataset -> song-level tags and similarity
  • Taste Profile subset -> user data
  • thisismyjam-to-MSD mapping -> more user data
  • tagtraum genre annotations -> genre labels
  • Top MAGD dataset -> more genre labels

The problem here is that this is a time-wasting circuitous process to get access to the raw data, and someone else’s suboptimal features are useless to me.


Mumu Oramas et al. (2017):

MuMu is a Multimodal Music dataset with multi-label genre annotations that combines information from the Amazon Reviews dataset and the Million Song Dataset (MSD). The former contains millions of album customer reviews and album metadata gathered from The latter is a collection of metadata and precomputed audio features for a million songs.

To map the information from both datasets we use MusicBrainz. This process yields the final set of 147,295 songs, which belong to 31,471 albums. For the mapped set of albums, there are 447,583 customer reviews from the Amazon Dataset. The dataset have been used for multi-label music genre classification experiments in the related publication. In addition to genre annotations, this dataset provides further information about each album, such as genre annotations, average rating, selling rank, similar products, and cover image url. For every text review it also provides helpfulness score of the reviews, average rating, and summary of the review.

The mapping between the three datasets (Amazon, MusicBrainz and MSD), genre annotations, metadata, data splits, text reviews and links to images are available here. Images and audio files can not be released due to copyright issues.

  • MuMu dataset (mapping, metadata, annotations and text reviews)

  • Data splits and multimodal embeddings for ISMIR multi-label classification experiments

AcousticBrainz Genre

The AcousticBrainz Genre Dataset (Bogdanov et al. 2019),

The AcousticBrainz Genre Dataset is a large-scale collection of hierarchical multi-label genre annotations from different metadata sources. It allows researchers to explore how the same music pieces are annotated differently by different communities following their own genre taxonomies, and how this could be addressed by genre recognition systems. With this dataset, we hope to contribute to developments in content-based music genre recognition as well as cross-disciplinary studies on genre metadata analysis.

Genre labels for the dataset are sourced from both expert annotations and crowds, permitting comparisons between strict hierarchies and folksonomies. Music features are available via the AcousticBrainz database.

HarmonixSet (Nieto et al. 2019)

Beats, downbeats, and functional structural annotations for 912 Pop tracks.

Includes Youtube URLs for downloads

Music – individual notes and voices

Nsynth and fs4s are both good. Get them for yourself.


Kyle McDonald, Freesound 4 seconds:

A mirror of all 126,900 sounds on Freesound less than 4 seconds long, as of April 4, 2017. Metadata for all sounds is stored in the files, and the high quality mp3s are stored in the files.



NSynth is an audio dataset containing 306,043 musical notes, each with a unique pitch, timbre, and envelope. For 1,006 instruments from commercial sample libraries, we generated four second, monophonic 16kHz audio snippets, referred to as notes, by ranging over every pitch of a standard MIDI piano (21-108) as well as five different velocities (25, 50, 75, 100, 127). The note was held for the first three seconds and allowed to decay for the final second.


Synthesize your own! (Campos et al. 2018; Vogl, Widmer, and Knees 2018)


Jakob Abeßer’s SMT-drums

The IDMT-SMT-Drums database is a medium-sized database for automatic drum transcription and source separation.

The dataset consists of 608 WAV files (44.1 kHz, Mono, 16bit). The approximate duration is 2:10 hours.

There are 104 polyphonic drum set recordings (drum loops) containing only the drum instruments kick drum, snare drum and hi-hat. For each drum loop, there are 3 training files for the involved instruments, yielding 312 training files for drum transcription purposes. The recordings are from three different sources:

  • Real-world, acoustic drum sets (RealDrum)
  • Drum sample libraries (WaveDrum)
  • Drum synthesizers (TechnoDrum)

For each drum loop, the onsets of kick drum, snare drum and hi-hat have been manually annotated. They are provided as XML and SVL files that can be assigned to the corresponding audio recording by their filename. Appropriate annotation file parsers are provided as MATLAB functions together with an example script showing how to import the complete dataset.

Freesound One-Shot Percussive Sounds

FREESOUND PERCUSSIVE Ramires et al. (2020a) (developed for Ramires et al. (2020b)) is a dataset of labelled sounds from freesound.

Other well-known science-y music datasets

Freesound doesn’t quite fit in with the rest, but it’s worth knowing anyway. Incredible database of raw samples for analysis, annotated with various Essentia descriptors, (i.e. hand-crafted features) plus user tags, descriptions and general good times, and deserves a whole entry of its own, if perhaps under “acoustic” rather than “musical” corpora.

Music – MIDI

MIDI! a symbolic music representation! Easy! not that flexible! but well crowd-sourced.

Colin Raffel’s Lakh MIDI dataset:

The Lakh MIDI dataset is a collection of 176,581 unique MIDI files, 45,129 of which have been matched and aligned to entries in the Million Song Dataset. Its goal is to facilitate large-scale music information retrieval, both symbolic (using the MIDI files alone) and audio content-based (using information extracted from the MIDI files as annotations for the matched audio files).

Slightly mysterious, the Musical AI MIDI DATAset

If you have MIDI, but you would prefer to have audio, perhaps you could render it to audio using MrsWatson or some other audio software libraries.

From Christian Walder at Data61: SymbolicMusicMidiDataV1.0

Music data sets of suspicious provenance, via Reddit:


Mozilla’s open-source crowd-sourced CommonVoice dataset:

Most of the data used by large companies isn’t available to the majority of people. We think that stifles innovation. So we’ve launched Common Voice, a project to help make voice recognition open and accessible to everyone.

Now you can donate your voice to help us build an open-source voice database that anyone can use to make innovative apps for devices and the web. Read a sentence to help machines learn how real people speak. Check the work of other contributors to improve the quality. It’s that simple!

A handy data set of speech on youtube It’s not clear where to download it from. The dataset it is based on, AVA doesn’t have the speech part.


This is something I will never use, and it is only marginally audio, but the AIST Dance DB is so charming that I must list it here.

Bertin-Mahieux, Thierry, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. 2011. “The Million Song Dataset.” In 12th International Society for Music Information Retrieval Conference (ISMIR 2011).

Bittner, Rachel M, Magdalena Fuentes, David Rubinstein, Andreas Jansson, Keunwoo Choi, and Thor Kell. 2019. “Mirdata: Software for Reproducible Usage of Datasets.” In International Society for Music Information Retrieval (ISMIR) Conference.

Bittner, Rachel M., Justin Salamon, Mike Tierney, Matthias Mauch, Chris Cannam, and Juan Pablo Bello. 2014. “MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research.” In ISMIR, 14:155–60.

Bogdanov, Dmitry, Alastair Porter, Hendrik Schreiber, Julián Urbano, and Sergio Oramas. 2019. “The Acousticbrainz Genre Dataset: Multi-Source, Multi-Level, Multi-Label, and Large-Scale.” In, 8.

Campos, Guilherme, Nuno Fonseca, Anibal Ferreira, and Matthew Davies. 2018. “Increasing Drum Transcription Vocabulary Using Data Synthesis.” In International Conference on Digital Audio Effects (DAFx-18), Aveiro, Portugal, September 4–8, 2018, 8.

Cannam, Chris. 2006. SonicAnnotator.

Defferrard, Michaël, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. 2017. “FMA: A Dataset for Music Analysis.” In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China.

Fonseca, Eduardo, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, and Xavier Serra. 2019. “Learning Sound Event Classifiers from Web Audio with Noisy Labels,” January.

Fonseca, Eduardo, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra. 2017. “Freesound Datasets: A Platform for the Creation of Open Audio Datasets.” In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China.

Gemmeke, Jort F., Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. “Audio Set: An Ontology and Human-Labeled Dataset for Audio Events.” In Proceedings of ICASSP 2017. New Orleans, LA.

Gillet, Olivier, and Gaël Richard. 2006. “ENST-Drums: An Extensive Audio-Visual Database for Drum Signals Processing.” In ISMIR.

Goto, Masataka. 2004. “Development of the RWC Music Database.” In Proceedings of the 18 Th International Congress on Acoustics (ICA 2004, 553–56.

Gouyon, F., A. Klapuri, S. Dixon, M. Alonso, G. Tzanetakis, C. Uhle, and P. Cano. 2006. “An Experimental Comparison of Audio Tempo Induction Algorithms.” IEEE Transactions on Audio, Speech, and Language Processing 14 (5): 1832–44.

Jehan, Tristan, and David DesRoches. 2011. “The Echonest Analyzer Documentation.” The Echonest.

Law, Edith, Kris West, and Michael I. Mandel. 2009. “Evaluation of Algorithms Using Games: The Case of Music Tagging.” In.

Nieto, Oriol, Matthew McCallum, Matthew E P Davies, Andrew Robertson, Adam Stark, and Eran Egozy. 2019. “The Harmonix Set: Beats, Downbeats, and Functional Segment Annotations of Western Popular Music.” In, 8.

Oramas, Sergio, Oriol Nieto, Francesco Barbieri, and Xavier Serra. 2017. “Multi-Label Music Genre Classification from Audio, Text, and Images Using Deep Features.” In ISMIR.

Rafii, Zafar, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. 2017. “The MUSDB18 Corpus for Music Separation,” December.

Ramires, António, Pritish Chandna, Xavier Favory, Emilia Gómez, and Xavier Serra. 2020a. “Freesound One-Shot Percussive Sounds.” Zenodo.

———. 2020b. “Neural Percussive Synthesis Parameterised by High-Level Timbral Features.” In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 786–90.

Southall, Carl, Chih-Wei Wu, Alexander Lerch, and Jason A. Hockman. 2017. “MDB Drums — an Annotated Subset of MedleyDB for Automatic Drum Transcription.” In Late Breaking Demo (Extended Abstract), Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Suzhou: International Society for Music Information Retrieval (ISMIR).

Thickstun, John, Zaid Harchaoui, and Sham Kakade. 2017. “Learning Features of Music from Scratch.” In Proceedings of International Conference on Learning Representations (ICLR) 2017.

Tsuchida, Shuhei, Satoru Fukayama, Masahiro Hamasaki, and Masataka Goto. 2019. “AIST Dance Video Database: Multi-Genre, Multi-Dancer, and Multi-Camera Databasefor Dance Information Processing.” In, 10.

Vogl, Richard, Gerhard Widmer, and Peter Knees. 2018. “Towards Multi-Instrument Drum Transcription.” In.