Audio/music corpora
Smells like Team Audioset
August 8, 2014 — December 6, 2023
Datasets of sound tend to be called audio corpora for reasons of tradition. I’ve listed some audio corpora that are useful to me, which means
- I need access to raw data (MIDI or audio), not someone else’s features
- I’d like some annotations that I can try to predict, such as genre (whatever that is?), or something less suggestive such as tempo.
- I don’t care about fancier types of metadata for now (social tagging or whatever); that’s a different level of abstraction to my current projects.
1 General issues
In these datasets, labels are noisy, as exemplified by Audioset. Weakly supervised techniques are worth considering.
2 preprocessing
First, be aware than many of the machine listening tools are useful also as preprocessors.
mirdata (Rachel M. Bittner et al. 2019) is a swiss army knife universal API to Music Information Retrieval data. (i.e. songs with some metadata).
This library provides tools for working with common MIR datasets, including tools for:
- downloading datasets to a common location and format
- validating that the files for a dataset are all present
- loading annotation files to a common format, consistent with the format required by mir_eval
- parsing track level metadata for detailed evaluations
percussive annotator is a simple webapp UI for annotating audio. Made for Ramires et al. (2020).
More recently HuggingFace has moved in here and probably obsoleted much prior work: A Complete Guide to Audio Datasets
🤗 Datasets is an open-source library for downloading and preparing datasets from all domains. Its minimalistic API allows users to download and prepare datasets in just one line of Python code, with a suite of functions that enable efficient pre-processing. The number of datasets available is unparalleled, with all the most popular machine learning datasets available to download.
Not only this, but 🤗 Datasets comes prepared with multiple audio-specific features that make working with audio datasets easy for researchers and practitioners alike
3 Other indices
- Alexander Lerch’s dataset page.
- ISMIR’s list
- Colin Raffel’s list is nifty because it includes a table breaking down datasets by task
4 General audio
4.1 LAION-Audio-630K
This repository is created for Audio Dataset Project, an audio dataset collection initiative announced by LAION. These datasets, each containing enormous amount of audio-text pairs, will be eventually processed and used for training CLAP (Contrastive language-Audio Pretraining) model and other models.
Here is an explicative video introducing you to the project.[…]
Since Audio Dataset is an open source project belongs to LAION, we have a team of open source contributors. They are, along with LAION members, a three-people researchers group including Yusong Wu, Ke Chen and Tianyu Zhang from Mila and UCSD, intern Marianna Nezhurina, previous intern Yuchen Hui, as well as many enthusiastic contributors from all over the world[…]
- We are keeping collecting audio datasets and here is the LIST of all what we found.
- We define the standard and method to store and process all audio datasets, which is essential in the sense of unifying the final format of datasets to simplify model training. The final dataset format we used for now is webdataset. The concrete data process pipeline is specified here
- You may also find the processing code for each processed audio datasets, respectively.
4.2 Audioset
Audioset (Gemmeke et al. 2017)):
Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets — principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 635 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.
5 FluCoMa
Fluid Corpus Manipulation / * flucoma/flucoma-core:
The Fluid Corpus Manipulation project (FluCoMa) instigates new musical ways of exploiting ever-growing banks of sound and gestures within the digital composition process, by bringing breakthroughs of signal decomposition DSP and machine learning to the toolset of techno-fluent computer composers, creative coders and digital artists.
These potent algorithms are currently partially available in closed bespoke software, or in laboratories, but not at a suitable level of modularity within the main coding environments used by the creative researchers, namely Max, Pd and SuperCollider, to allow groundbreaking sonic research into a rich unexploited area: the manipulation of large sound corpora. Indeed, with access to, genesis of, and storage of large sound banks now commonplace, novel ways of abstracting and manipulating them are needed to mine their inherent potential.
Sounds a little quirky. Probably if one were to use this is could be via the CLI.
5.1 youtube8m
YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs and associated labels from a diverse vocabulary of 4700+ visual entities. It comes with precomputed state-of-the-art audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This makes it possible to get started on this dataset by training a baseline video model in less than a day on a single machine! At the same time, the dataset’s scale and diversity can enable deep exploration of complex audio-visual models that can take weeks to train even in a distributed fashion.
Our goal is to accelerate research on large-scale video understanding, representation learning, noisy data modeling, transfer learning, and domain adaptation approaches for video. More details about the dataset and initial experiments can be found in our technical report. Some statistics from the latest version of the dataset are included below.
5.2 FSDnoisy18k
FSDnoisy18k: (Fonseca et al. 2019)
FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.
The source of audio content is Freesound—a sound sharing site created an maintained by the Music Technology Group hosting over 400,000 clips uploaded by its community of users, who additionally provide some basic metadata (e.g., tags, and title). More information about the technical aspects of Freesound can be found in [14, 15].
The 20 classes of FSDnoisy18k are drawn from the AudioSet Ontology and are selected based on data availability as well as on their suitability to allow the study of label noise. They are listed in the next Table, along with the number of audio clips per class (split in different data subsets that are defined in section FSDnoisy18k basic characteristics). Every numeric entry in the next table reads: number of clips / duration in minutes (rounded). For instance, the Acoustic guitar class has 102 audio clips in the clean subset of the train set, and the total duration of these clips is roughly 11 minutes.
We defined a clean portion of the dataset consisting of correct and complete labels. The remaining portion is referred to as the noisy portion. Each clip in the dataset has a single ground truth label (singly-labeled data).
The clean portion of the data consists of audio clips whose labels are rated as present in the clip and predominant (almost all with full inter-annotator agreement), meaning that the label is correct and, in most cases, there is no additional acoustic material other than the labeled class. A few clips may contain some additional sound events, but they occur in the background and do not belong to any of the 20 target classes. This is more common for some classes that rarely occur alone, e.g., “Fire”, “Glass”, “Wind” or “Walk, footsteps”.
The noisy portion of the data consists of audio clips that received no human validation. In this case, they are categorized on the basis of the user-provided tags in Freesound. Hence, the noisy portion features a certain amount of label noise.
5.3 FUSS
We are happy to announce the release of FUSS: the Free Universal Sound Separation dataset.
Audio recordings often contain a mixture of different sound sources; Universal sound separation is the ability to separate such a mixture into its component sounds, regardless of the types of sound present. Previously, sound separation work has focused on separating mixtures of a small number of sound types, such as “speech” versus “nonspeech”, or different instances of the same type of sound, such as speaker #1 versus speaker #2. Often in such work, the number of sounds in a mixture is also assumed to be known a priori. The FUSS dataset shifts focus to the more general problem of separating a variable number of arbitrary sounds from one another.
One major hurdle to training models in this domain is that even if you have high-quality recordings of sound mixtures, you can’t easily annotate these recordings with ground truth. High-quality simulation is one approach to overcome this limitation. To achieve good results, you need a diverse set of sounds, a realistic room simulator, and code to mix these elements together for realistic, multi-source, multi-class audio with ground truth. With FUSS, we are releasing all three of these.
6 Whole songs
tl;dr Use Free Music Archive, which is faster, simpler and higher quality. Youtube Music Videos and Magatagatune have more noise and less convenience. Which is not to say that FMA is fast, simple and of high quality. It’s still chaos because the creators are all busy doing their own dissertations I guess, but it is better than the alternatives.
See also commercial song libraries, which sometimes have enough metadata to do even supervised learning.
6.1 Free music archive
FMA (Defferrard et al. 2017) is an annotated, ML-ready dataset constructed from the Free Music Archive1:
We introduce the Free Music Archive (FMA), an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and organizing large music collections. The community’s growing interest in feature and end-to-end learning is however restrained by the limited availability of large audio datasets. The FMA aims to overcome this hurdle by providing 917 GiB and 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. It provides full-length and high-quality audio, pre-computed features, together with track- and user-level metadata, tags, and free-form text such as biographies. We here describe the dataset and how it was created, propose a train/validation/test split and three subsets, discuss some suitable MIR tasks, and evaluate some baselines for genre recognition.
Note that at time of writing, the default code version of the did not use the same format as the default data version. I needed to check out a particular, elderly version of the code, rc1. I suppose you could rebuild the dataset index from scratch yourself, but this would need much CPU time.
6.2 Youtube music video 8m
The musical competitor to the above non-musical one, created by Keunwoo Choi. Youtube Music Video 8m:
[…] there are loads of music video online […] unlike in the case you’re looking for a dataset that provide music signal, you can just download music contents. For free. No API blocking. No copyright law bans you to do so (redistribution is restricted though). Not just 30s preview but the damn whole song. Just because it’s not mp3 but mp4.
As a MIR researcher I found it annoying but blessing. {music} is banned but {music video} is not! […]
I beta-released YouTube Music Video 5M dataset. The readme has pretty much everything and will be up-to-date. In short, It’s 5M+ youtube music video URLs that are categorised by Spotify artist IDs which are sorted by some sort of artist popularity.
Unfortunately I can’t redistribute anything further that is crawled from either YouTube (e.g., music video titles) or Spotify (e.g., artist genre labels) but you can get them by yourself.
The spotify metadata is pretty good, and the youtube audio quality is OK, so this is a handy dataset. But much assembly required, and much bandwidth.
6.3 Dunya project
Universitat Pompeu-Fabra is trying to collect large and minutely analyzed sets of data from several distinct not-necessarily-central-european traditions, and have comprehensive software tools too:
See the Dunya project site.
6.4 The ballroom set
The ballroom music data set (Gouyon et al. 2006):
In this document we report on the tempo induction contest held as part of the ISMIR 2004 Audio Description Contests, organized at the University Pompeu Fabra in Barcelona in September 2004 and won by Anssi Klapuri from Tampere University. […]
BallroomDancers.com gives many informations [sic] on ballroom dancing (online lessons, etc). Some characteristic excerpts of many dance styles are provided in real audio format. Their tempi are also available.
Total number of instances: 698
Duration: ~30 s
Total duration: ~20940 s
Genres
- Cha Cha, 111
- Jive, 60
- Quickstep 82
- Rumba, 98
- Samba, 86
- Tango, 86
- Viennese Waltz, 65
- Slow Waltz, 110
6.5 MusicNet
MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note every recording, the instrument that plays each note, and the note’s position in the metrical structure of the composition. The labels are acquired from musical scores aligned to recordings by dynamic time warping. The labels are verified by trained musicians; we estimate a labeling error rate of 4%. We offer the MusicNet labels to the machine learning and music communities as a resource for training models and a common benchmark for comparing results.
Data with purely automatically generated labels (they inferred the annotation from the raw samples using DSP, giving larger data sets and more errors than hand-labelled stuff.)
6.6 Magnatagatune
Magnatagatune is a data set of pop songs with substantial audio and metadata, good for classification tasks. Announcement. (Data)
This dataset consists of ~25000 29s long music clips, each of them annotated with a combination of 188 tags. The annotations have been collected through Edith’s “TagATune” game. The clips are excerpts of songs published by Magnatune.com
There is a list of articles using this data set.
7 Music — multitrack/stems
🏗 see if spotify made their acapella/whole song database available.
Music stems sets of suspicious provenance, via Reddit, were on r/SongStems, but it turned out, unsuprsignly that they were banned for copyright violations .
They have been replaced by a torrent?
7.1 Sisec signal separation
A.k.a. MusDB a.k.a. Mus18.
Sisec sigsep did a multitrack musical separation contest
Professionally-produced music recordings
New dataset and Python tools to handle it 100 training songs, 50 test songs, all stereo@44.1 kHz produced recordings All songs include drums, bass, other, vocals stems Songs are encoded in the Native Instruments stems format, with a tool to convert them back and forth to wav and to load them directly in Python. Automatic download of data Python code to analyze results and produce plots for your own paper.
The actual dataset is AFAICT called musdb.
7.2 MedleyDB
MedleyDB (Rachel M. Bittner et al. 2014):
MedleyDB, a dataset of annotated, royalty-free multitrack recordings. MedleyDB was curated primarily to support research on melody extraction, addressing important shortcomings of existing collections. For each song we provide melody f0 annotations as well as instrument activations for evaluating automatic instrument recognition. The dataset is also useful for research on tasks that require access to the individual tracks of a song such as source separation and automatic mixing…
Dataset Snapshot
Size: 122 Multitracks (mix + processed stems + raw audio + metadata)
Annotations
- Melody f0 (108 tracks)
- Instrument Activations (122 tracks)
- Genre (122 tracks)
Audio Format: WAV (44.1 kHz,16 bit)
Genres:
- Singer/Songwriter
- Classical
- Rock
- World/Folk
- Fusion
- Jazz
- Pop,
- Musical Theatre
- Rap
Track Length
- 105 full length tracks (~3 to 5 minutes long)
- 17 excerpts (7:17 hours total)
Instrumentation
- 52 instrumental tracks
- 70 tracks containing vocals
7.3 Bonus: Mdbdrums
Mdbdrums (Southall et al. 2017)
This repository contains the MDB Drums dataset which consists of drum annotations and audio files for 23 tracks from the MedleyDB dataset.
Two annotation files are provided for each track. The first annotation file, termed class, groups the 7994 onsets into 6 classes based on drum instrument. The second annotation file, termed subclass, groups the onsets into 21 classes based on playing technique and instrument.
8 Musical — just metadata
tl;dr If I am using these datasets, I am using someone else’s outdated metadata. It will probably be better if I just benchmark against published research using these databases than reinventing my own wheel. Or maybe I would use these if I wanted to augment a dataset you already have? I’d want to be sure I could match these dataset together with high certainty in that case. Rumour holds that stitching some of these datasets together is challenging.
8.1 Million songs
You have to mention this one because everyone does, but it’s useless for me.
The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.
Its purposes are:
To encourage research on algorithms that scale to commercial sizes
To provide a reference dataset for evaluating research
As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest’s)
To help new researchers get started in the MIR field
The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. […]
The Million Song Dataset is also a cluster of complementary datasets contributed by the community:
- SecondHandSongs dataset -> cover songs
- musiXmatch dataset -> lyrics
- Last.fm dataset -> song-level tags and similarity
- Taste Profile subset -> user data
- thisismyjam-to-MSD mapping -> more user data
- tagtraum genre annotations -> genre labels
- Top MAGD dataset -> more genre labels
The problem is that this is a time-wasting circuitous process to get access to the raw data, and someone else’s suboptimal features are useless to me.
8.1.1 Mumu
MuMu is a Multimodal Music dataset with multi-label genre annotations that combines information from the Amazon Reviews dataset and the Million Song Dataset (MSD). The former contains millions of album customer reviews and album metadata gathered from Amazon.com. The latter is a collection of metadata and precomputed audio features for a million songs.
To map the information from both datasets we use MusicBrainz. This process yields the final set of 147,295 songs, which belong to 31,471 albums. For the mapped set of albums, there are 447,583 customer reviews from the Amazon Dataset. The dataset have been used for multi-label music genre classification experiments in the related publication. In addition to genre annotations, this dataset provides further information about each album, such as genre annotations, average rating, selling rank, similar products, and cover image url. For every text review it also provides helpfulness score of the reviews, average rating, and summary of the review.
The mapping between the three datasets (Amazon, MusicBrainz and MSD), genre annotations, metadata, data splits, text reviews and links to images are available here. Images and audio files can not be released due to copyright issues.
MuMu dataset (mapping, metadata, annotations and text reviews)
Data splits and multimodal embeddings for ISMIR multi-label classification experiments
9 AcousticBrainz Genre
The AcousticBrainz Genre Dataset (Bogdanov et al. 2019),
The AcousticBrainz Genre Dataset is a large-scale collection of hierarchical multi-label genre annotations from different metadata sources. It allows researchers to explore how the same music pieces are annotated differently by different communities following their own genre taxonomies, and how this could be addressed by genre recognition systems. With this dataset, we hope to contribute to developments in content-based music genre recognition as well as cross-disciplinary studies on genre metadata analysis.
Genre labels for the dataset are sourced from both expert annotations and crowds, permitting comparisons between strict hierarchies and folksonomies. Music features are available via the AcousticBrainz database.
HarmonixSet (Nieto et al. 2019)
Beats, downbeats, and functional structural annotations for 912 Pop tracks.
Includes Youtube URLs for downloads
10 Music — individual notes and voices
Nsynth and fs4s are both good. Get them for yourself.
10.1 freesound4seconds
Kyle McDonald, Freesound 4 seconds:
A mirror of all 126,900 sounds on Freesound less than 4 seconds long, as of April 4, 2017. Metadata for all sounds is stored in the json.zip files, and the high quality mp3s are stored in the mp3.zip files.
10.2 nsynth
NSynth is an audio dataset containing 306,043 musical notes, each with a unique pitch, timbre, and envelope. For 1,006 instruments from commercial sample libraries, we generated four second, monophonic 16kHz audio snippets, referred to as notes, by ranging over every pitch of a standard MIDI piano (21-108) as well as five different velocities (25, 50, 75, 100, 127). The note was held for the first three seconds and allowed to decay for the final second.
11 Percussion
Synthesize your own! (Campos et al. 2018; Vogl, Widmer, and Knees 2018)
11.1 SMT-drums
Jakob Abeßer’s SMT-drums
The IDMT-SMT-Drums database is a medium-sized database for automatic drum transcription and source separation.
The dataset consists of 608 WAV files (44.1 kHz, Mono, 16bit). The approximate duration is 2:10 hours.
There are 104 polyphonic drum set recordings (drum loops) containing only the drum instruments kick drum, snare drum and hi-hat. For each drum loop, there are 3 training files for the involved instruments, yielding 312 training files for drum transcription purposes. The recordings are from three different sources:
- Real-world, acoustic drum sets (RealDrum)
- Drum sample libraries (WaveDrum)
- Drum synthesizers (TechnoDrum)
For each drum loop, the onsets of kick drum, snare drum and hi-hat have been manually annotated. They are provided as XML and SVL files that can be assigned to the corresponding audio recording by their filename. Appropriate annotation file parsers are provided as MATLAB functions together with an example script showing how to import the complete dataset.
11.2 Freesound One-Shot Percussive Sounds
Freesound Percussive António Ramires et al. (2020) (developed for Ramires et al. (2020)) is a dataset of labelled sounds from freesound.
12 Other well-known science-y music datasets
- The classic USPOP CAL500 CAL10K etc
- RWC (crosschecks MIDI against Audio)
- 78 project is the internet archive collection of 78rpm shellacs
- Piano-midi.de : classical piano pieces
- Nottingham : over 1000 folk tunes
- MuseData : electronic library of classical music scores
- JSB Chorales : set of four-part harmonized chorales
Freesound doesn’t quite fit in with the rest, but it’s worth knowing anyway. Incredible database of raw samples for analysis, annotated with various Essentia descriptors, (i.e. hand-crafted features) plus user tags, descriptions and general good times, and deserves a whole entry of its own, if perhaps under “acoustic” rather than “musical” corpora.
13 Music — MIDI
MIDI! a symbolic music representation! Easy! not that flexible! but well crowd-sourced.
Colin Raffel’s Lakh MIDI dataset:
The Lakh MIDI dataset is a collection of 176,581 unique MIDI files, 45,129 of which have been matched and aligned to entries in the Million Song Dataset. Its goal is to facilitate large-scale music information retrieval, both symbolic (using the MIDI files alone) and audio content-based (using information extracted from the MIDI files as annotations for the matched audio files).
Slightly mysterious, the Musical AI MIDI DATAset
If you have MIDI, but you would prefer to have audio, perhaps you could render it to audio using MrsWatson or some other audio software libraries.
From Christian Walder at Data61: SymbolicMusicMidiDataV1.0
Music data sets of suspicious provenance, via Reddit:
14 Voice
Mozilla’s open-source crowd-sourced CommonVoice dataset:
Most of the data used by large companies isn’t available to the majority of people. We think that stifles innovation. So we’ve launched Common Voice, a project to help make voice recognition open and accessible to everyone.
Now you can donate your voice to help us build an open-source voice database that anyone can use to make innovative apps for devices and the web. Read a sentence to help machines learn how real people speak. Check the work of other contributors to improve the quality. It’s that simple!
A handy data set of speech on youtube It’s not clear where to download it from. The dataset it is based on, AVA doesn’t have the speech part.
15 Dance
This is something I will never use, and it is only marginally audio, but the AIST Dance DB is so charming that I must list it.
See sample management.
16 References
Footnotes
FMA has closed down, please see the archive.org backup of the audio.↩︎