See also musical corpora for some specialised music ones.
Generic tools for construction thereof
The Engauge Digitizer tool accepts image files (like PNG, JPEG and TIFF) containing graphs, and recovers the data points from those graphs. The resulting data points are usually used as input to other software applications. Conceptually, Engauge Digitizer is the opposite of a graphing tool that converts data points to graphs. [..] an image file is imported, digitized within Engauge, and exported as a table of numeric data to a text file.
(They mean graph in the sense of plot, not in the sense of network.)
Datasets about Australia
See Australia in data.
Miscellaneous data sets
Rdatasets collates all the most popular R datasets
Zenodo is similar. Backed by CERN, on their infrastructure. Hosts many published scientific data sets
Machine learning cult phenomenon Kaggle now does collaborative data set cleaning and publishing: kaggle data sets, such as NOAA weather. Every time you say, about this data set, “this really puts the ‘cloud’ in ‘cloud computing’” a meteorologist comes over to your house and slaps you.
IEEE Dataport is free for IEEE members and happily hosts 2TB datasets. It gives you a DOI and integrates with many IEEE publications, plus allows convenient access from the Amazon cloud via AWS, which might be where your data is anyway. However, they charge USD2000 for an open access version, and otherwise only other IEEE dataport users can get at your data. I know this is not an unusual way for access to journal articles to work, but for data sets it feels like a ham-fisted way of enforcing scarcity. Not to undercut my own professional society here, but if you can do without a DOI, I will happily upload your data for AWS for you for, say, USD1500, which will pay for 2 very lucrative hours of my time.
Nuit Blanche’s listing of data sets is handy if you want some good inverse-problem signal processing challenges.
The Social Media Research Toolkit is a list of 50+ social media research tools curated by researchers at the Social Media Lab at Ted Rogers School of Management, Ryerson University.
So not necessarily data, but the software to get it.
The Seshat Global History Databank brings together the most current and comprehensive body of knowledge about human history in one place. Our unique Databank systematically collects what is currently known about the social and political organization of human societies and how civilizations have evolved over time.
Quandl has some databases.
Torrent technology allows a group of editors to “seed” their own peer-reviewed published articles with just a torrent client. Each editor can have part or all of the papers stored on their desktops and have a torrent tracker to coordinate the delivery of papers without a dedicated server.
One aim of this site is to create the infrastructure to allow open access journals to operate at low cost. By facilitating file transfers, the journal can focus on its core mission of providing world class research. After peer review the paper can be indexed on this site and disseminated throughout our system.
Large dataset delivery can be supported by researchers in the field that have the dataset on their machine. A popular large dataset doesn’t need to be housed centrally. Researchers can have part of the dataset they are working on and they can help host it together.
Libraries can host this data to host papers from their own campus without becoming the only source of the data. So even if a library’s system is broken other universities can participate in getting that data into the hands of researchers.
prodigy is an interactive dataset annotator for training classifiers
Collected open data sets at cloud providers
Various providers host data sets conveniently close to their cloud platforms
3d sensor data
Stashed at 3D data.