Data sets

Questions for answers looking for questions

2015-06-26 — 2024-11-18

Wherein a wide variety of public data sets are surveyed, and curated benchmark suites such as the Penn Machine Learning Benchmarks, SDMX‑accessible official statistics, and large social media archives are noted.

computers are awful

data sets

statistics

1 Metadatasets

Penn Machine Learning Benchmarks (Olson et al. 2017)

Background: The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark datasets have emerged from different sources, but their organization and adoption as standards have been inconsistent. As such, selecting and curating specific benchmarks remains an unnecessary burden on machine learning practitioners and data scientists.

Results: The present study introduces an accessible, curated, and developing public benchmark resource to facilitate identification of the strengths and weaknesses of different machine learning methodologies. We compare meta-features among the current set of benchmark datasets in this resource to characterise the diversity of available data. Finally, we apply a number of established machine learning methods to the entire benchmark suite and analyse how datasets and algorithms cluster in terms of performance. From this study, we find that existing benchmarks lack the diversity to properly benchmark machine learning algorithms, and there are several gaps in benchmarking problems that still need to be considered.

Have used this one; it is very convenient for generic benchmarking stuff.
- Source at EpistasisLab/pmlb.
Google’s dataset search
Rdatasets collates all the most popular R datasets
Official statistics! Usually a dog’s breakfast in format, but interesting in content. SNStatComp/awesome-official-statistics-software: An awesome list of statistical software for creating and accessing official statistics
Reproduce someone else’s results! Figshare hosts the supporting data for many amazing papers. E.g. here’s 1.4. Gb of synapses firing.
Zenodo is similar. Backed by CERN, on their infrastructure. Hosts many published scientific data sets. Discovery is not great through Zenodo itself, but if you read an interesting paper…
Machine learning cult phenomenon Kaggle now does data set cleaning and publishing: kaggle data sets, such as NOAA weather. ¹
IEEE Dataport is free for IEEE members and happily hosts 2TB datasets. It gives you a DOI and integrates with many IEEE publications, plus allows convenient access from the Amazon cloud via AWS, which might be where your data is anyway. However, they charge USD2000 for an open access version, and otherwise only other IEEE dataport users can get at your data. I know this is not an unusual way for access to journal articles to work, but for data sets it feels like a ham-fisted way of enforcing scarcity. Not to undercut my own professional society here, but if you can do without a DOI, I will happily upload your data for AWS for you for, say, USD1500, which will pay for 2 lucrative hours of my time.
Nuit Blanche’s listing of data sets is handy if you want some good inverse-problem signal processing challenges.
beautifulpublicdata.com
datamarket
Datasets on reddit
- On and about Reddit: All the Reddit comments now in BigQuery
Quandl provides some databases.
CSRP does too? — perhaps accessible to me via Wharton?
academic torrents:
Torrent technology allows a group of editors to “seed” their own peer-reviewed published articles with just a torrent client. Each editor can have part or all of the papers stored on their desktops and have a torrent tracker to coordinate the delivery of papers without a dedicated server.
- One aim of this site is to create the infrastructure to allow open access journals to operate at low cost. By facilitating file transfers, the journal can focus on its core mission of providing world class research. After peer review the paper can be indexed on this site and disseminated throughout our system.
- Large dataset delivery can be supported by researchers in the field that have the dataset on their machine. A popular large dataset doesn’t need to be housed centrally. Researchers can have part of the dataset they are working on and they can help host it together.
- Libraries can host this data to host papers from their own campus without becoming the only source of the data. So even if a library’s system is broken other universities can participate in getting that data into the hands of researchers.
prodigy is an interactive dataset annotator for training classifiers
Various cloud compute providers host data sets conveniently close to their cloud platforms which minimises data movement costs
- Microsoft research open data
- Registry of open data on AWS
- Google? IBM?

2 Real estate

Data Skeptic’s Open House Project
Inside AirBnB

3 Social science

SESHAT:

The Seshat Global History Databank brings together the most current and comprehensive body of knowledge about human history in one place. Our unique Databank systematically collects what is currently known about the social and political organization of human societies and how civilizations have evolved over time.

3.1 Social networks

I’m no longer in this area, so I won’t say much on this.

Social science one is the scheme you join to get them to run your experiments on anonymised Facebook data for you. In practice it has not been working.
UCI datasets are diverse.
Leskovec lab
- J. Yang, J. Leskovec. Temporal Variation in Online Media. ACM International Conference on Web Search and Data Mining (WSDM ’11), 2011.
- Twitter7:
  
  467 million Twitter posts from 20 million users covering a 7 month period from June 1 2009 to December 31 2009. We estimate this is about 20-30% of all public tweets published on Twitter during the particular time frame.
  
  As per request from Twitter the data is no longer available.
- higgs-twitter
  
  The Higgs dataset has been built after monitoring the spreading processes on Twitter before, during and after the announcement of the discovery of a new particle with the features of the elusive Higgs boson on 4th July 2012. The messages posted in Twitter about this discovery between 1st and 7th July 2012 are considered.
Patent citation networks (these are available and reasonably well annotated)
Wikipedia articles and their references (readily available)
- also includes easily-parseable mathematical data and theorems
- …and edit trails
- …and category annotations
- and semantic metadata
- probably more data than you can use
On that theme, Wikidata which attempts to construct a semantic graph of entity relations between things mentioned in, basically, Wikipedia.
source code of large collaborative projects (Linux or BSD kernel, openoffice, python, Perl, GCC etc)
- can I parse such projects to see how interfaces form?
- Are there odd stylised facts about contribution to these that I might be able to explain?
- Or call-graphs?
Social Media Research Toolkit:

The Social Media Research Toolkit is a list of 50+ social media research tools curated by researchers at the Social Media Lab at Ted Rogers School of Management, Ryerson University.

So not necessarily data, but the software to get it.
Microsoft Academic Knowledge Graph

We present the Microsoft Academic Knowledge Graph (MAKG), a large RDF data set with over eight billion triples with information about scientific publications and related entities, such as authors, institutions, journals, and fields of study. The data set is based on the Microsoft Academic Graph and licensed under the Open Data Attributions license. Furthermore, we provide entity embeddings for all 210M represented scientific papers.
Free-text stuff: Some blog data set?
sods/ods: Open Data Science includes a huge number of small and useful datasets wrapped in a python interface. The documentation is not clear or obvious, and the release schedule is abysmal. Nonetheless, pip install git+https://github.com/sods/ods.git gets you there.

3.2 Official

See SNStatComp/awesome-official-statistics-software

Awesome official statistics software

An item on this list is awesome because it is:

free, open source, available for download and

used in the production of, or provides access to, official statistics.

We prefer software that is easy to install and use and actively maintained.

The big discovery for me on that list was the SDMX standard, which promises to make official stats less quirky. There are some packages which ease use of this standard:

R package rsdmx. Access to data or metadata from statistical organisations that support SDMX webservices. The package contains a list of SDMX access points of various national and international statistical institutes.

R package readsdmx. Read SDMX into dataframes from local SDMX-ML file or web-service. Parts in C++. By OECD.

Python sdmx. Python library that implements SDMX 2.1 to explore data from SDMX data providers, parse data and metadata and convert it into Pandas objects.

4 Datasets about Australia

See Australia in data.

5 Time series/forecasting

Datamarket’s time series data library by Rob Hyndman. From Rob see also a history of time series forecasting competitions

recommended time series sources

The Mcompetitions (Makridakis, Spiliotis, and Assimakopoulos 2020) datasets are classic.

Here are some background links for perspective on these

6 3d sensor data

Stashed at 3D data.

7 Geodata

See spatial data sets

8 Causal

microsoft/csuite: CSuite: A Suite of Benchmark Datasets for Causality

9 Generic tools for construction thereof

Engauge

The Engauge Digitizer tool accepts image files (like PNG, JPEG and TIFF) containing graphs, and recovers the data points from those graphs. The resulting data points are usually used as input to other software applications. Conceptually, Engauge Digitizer is the opposite of a graphing tool that converts data points to graphs. […] an image file is imported, digitized within Engauge, and exported as a table of numeric data to a text file.

(They mean graph in the sense of plot, not in the sense of network.)
Dataset Search
fastdownload: the magic behind one of the famous 4 lines of code

10 References

Brockman, Cheung, Pettersson, et al. 2016. “OpenAI Gym.” arXiv:1606.01540 [Cs].

Makridakis, Spiliotis, and Assimakopoulos. 2020. “The M4 Competition: 100,000 Time Series and 61 Forecasting Methods.” International Journal of Forecasting, M4 Competition,.

Olson, La Cava, Orzechowski, et al. 2017. “PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison.” BioData Mining.

Footnotes

Every time you say, about this data set, “this really puts the ‘cloud’ in ‘cloud computing’” a meteorologist comes over to your desk and slaps you.↩︎