Data cleaning

90% of statistics

January 22, 2020 — January 11, 2024

information provenance
statistics
Figure 1

Related: outlier detection. Useful: text data wrangling

Great expectations verifies that your data conforms to the specified distributions. (Keyword: Pipeline testing).

See This recent snorkel review..

Industrial perspective Inside Palantir, Silicon Valley’s Most Secretive Unicorn:

The need for customization points to a deeper problem for Palantir. The customization is what clients like, but it’s also what could prevent the company from scaling. All those software engineers sleeping under their desks may have been great in 2005, when the company was flush with venture capital, but employing an army of humans to endlessly tweak the software doesn’t exactly presage huge profits. “I used to have a metric when I was in the government,” said the former senior intelligence official who visited Palantir’s engineers back in their sleeping-bag days. “People would come in and say, ‘We’ve got this fantastic automated translation system,’ or automated anything. I would say, ‘Does this use RFOP?’ And they would say, ‘I don’t know what that is.’ ”

The acronym stood for Rooms Full of People, meaning the army of analysts required to clean up the data and crunch the numbers. How good any given data-mining system is depends in large part on what’s lurking behind the curtain. Is it artificial intelligence parsing large data sets of complex financial transactions to find the next terrorist? Or is it a room full of eager software engineers sleeping on the floor? Palantir portrays its software as like its namesake — a crystal ball you gaze into for answers. The company emphasizes that it has reduced the time needed to get its software up and running, and former officials told me Palantir has made big improvements to its back end over the years. But the truth is that it still appears to take a lot of manual labor to make it work, and there’s nothing magical about that.

Left field: do not clear the data, but learn from dirty data:

VoG: Variance Of Gradients (Agarwal, D’souza, and Hooker 2021).

In this work, we propose Variance of Gradients (VOG) as a valuable and efficient proxy metric for detecting outliers in the data distribution. We provide quantitative and qualitative support that VOG is a meaningful way to rank data by difficulty and to surface a tractable subset of the most challenging examples for human-in-the-loop auditing. Data points with high VOG scores are far more difficult for the model to learn and over-index on corrupted or memorized examples

1 Incoming

2 References

Agarwal, D’souza, and Hooker. 2021. Estimating Example Difficulty Using Variance of Gradients.” arXiv:2008.11600 [Cs].
Ratner, Bach, Ehrenberg, et al. 2017. Snorkel: Rapid Training Data Creation with Weak Supervision.” Proceedings of the VLDB Endowment.