Data cleaning
90% of statistics
January 22, 2020 — January 11, 2024
Related: outlier detection. Useful: text data wrangling
Great expectations verifies that your data conforms to the specified distributions. (Keyword: Pipeline testing).
skrub: Prepping tables for machine learning — skrub is a
sklearn
-compatible system.Edwin de Jonge and Mark van der Loo, Data cleaning with R.
kieranhealy’s walk-through, Unhappy in its Own Way is full of useful tips.
Snorkel is a hybrid method that, as far as I can tell, iteratively refines weak-labels:
Today’s state-of-the-art machine learning models require massive labeled training sets — which usually do not exist for real-world applications. Instead, Snorkel is based around the new data programming paradigm, in which the developer focuses on writing a set of labeling functions, which are just scripts that programmatically label data. The resulting labels are noisy, but Snorkel automatically models this process—learning, essentially, which labeling functions are more accurate than others—and then uses this to train an end model (for example, a deep neural network in TensorFlow).
Surprisingly, by modeling a noisy training set creation process in this way, we can take potentially low-quality labeling functions from the user, and use these to train high-quality end models. We see Snorkel as providing a general framework for many weak supervision techniques, and as defining a new programming model for weakly-supervised machine learning systems.
See This recent snorkel review..
Industrial perspective Inside Palantir, Silicon Valley’s Most Secretive Unicorn:
The need for customization points to a deeper problem for Palantir. The customization is what clients like, but it’s also what could prevent the company from scaling. All those software engineers sleeping under their desks may have been great in 2005, when the company was flush with venture capital, but employing an army of humans to endlessly tweak the software doesn’t exactly presage huge profits. “I used to have a metric when I was in the government,” said the former senior intelligence official who visited Palantir’s engineers back in their sleeping-bag days. “People would come in and say, ‘We’ve got this fantastic automated translation system,’ or automated anything. I would say, ‘Does this use RFOP?’ And they would say, ‘I don’t know what that is.’ ”
The acronym stood for Rooms Full of People, meaning the army of analysts required to clean up the data and crunch the numbers. How good any given data-mining system is depends in large part on what’s lurking behind the curtain. Is it artificial intelligence parsing large data sets of complex financial transactions to find the next terrorist? Or is it a room full of eager software engineers sleeping on the floor? Palantir portrays its software as like its namesake — a crystal ball you gaze into for answers. The company emphasizes that it has reduced the time needed to get its software up and running, and former officials told me Palantir has made big improvements to its back end over the years. But the truth is that it still appears to take a lot of manual labour to make it work, and there’s nothing magical about that.
Left field: do not clear the data, but learn from dirty data:
- dirty_cat: machine learning on dirty categories
- Machine-learning on dirty data in Python: a tutorial
VoG: Variance Of Gradients (Agarwal, D’souza, and Hooker 2021).
In this work, we propose Variance of Gradients (VOG) as a valuable and efficient proxy metric for detecting outliers in the data distribution. We provide quantitative and qualitative support that VOG is a meaningful way to rank data by difficulty and to surface a tractable subset of the most challenging examples for human-in-the-loop auditing. Data points with high VOG scores are far more difficult for the model to learn and over-index on corrupted or memorized examples
1 Incoming
- WebPlotDigitizer.
- Cleanlab: “We publish research, develop open source tools, and design interfaces to help you improve the quality of your datasets and diagnose various issues in them.” e.g. ActiveLab: Active Learning with Data Re-Labeling