Data summarization
a.k.a Data distillation
January 14, 2019 — November 26, 2024
Summary statistics don’t require us to keep the data but allow us to do inference nearly as well. e.g. sufficient statistics in exponential families allow us to do certain kinds of inference perfectly with just summaries. Most statistical problems are harder than that, though. Methods such as variational Bayes summarise the posterior likelihood by maintaining an approximating posterior density encoding the information in the data likelihood, at some cost in accuracy. Sometimes the best summary of the posterior likelihood can be, not a density directly, but something like a smaller version of the dataset itself.
Outside of the Bayesian setting, we might not want to think about a posterior density but some predictive performance. I understand that this works as well, but I know less about the details.
There are many variants of this idea
- inducing sets, as seen in Gaussian processes
- coresets
- Bounded Memory Learning considers the idea of using only some from a computational complexity standpoint — is that related?
- Some dimension reductions are data summarisation but in a different sense than in this notebook; the summaries no longer look like the data.
TBC.
1 Bayes duality
Not sure what this is, but it sounds relevant.
-
The new learning paradigm will be based on a new principle of machine learning, which we call the Bayes-Duality principle and will develop during this project. Conceptually, the new principle hinges on the fundamental idea that an AI should be capable of efficiently preserving and acquiring the relevant past knowledge, for a quick adaptation in the future. We will apply the principle to representation of the past knowledge, faithful transfer to new situations, and collection of new knowledge whenever necessary. Current Deep-learning methods lack these mechanisms and instead focus on brute-force data collection and training. Bayes-Duality aims to fix these deficiencies.
Connection to Bayes by Backprop, continual learning, and neural memory?
2 Coresets
Bayesian. Solve an optimisation problem to minimise distance between posterior with all data and with a weighted subset.
3 Representative subsets
I think this is intended to be generic, i.e. not necessarily Bayesian. See apricot:
apricot implements submodular optimisation for the purpose of summarising massive data sets into minimally redundant subsets that are still representative of the original data. These subsets are useful for both visualising the modalities in the data (such as in the two data sets below) and for training accurate machine learning models with just a fraction of the examples and compute.
4 By influence functions
Including data based on how much it changes the predictions. Classic approach (Cook 1977; Thomas and Cook 1990) for linear models. Can be applied also to DL models, apparently. A bayes-by-backprop approach is Nickl et al. (2023), and a heroic second-order method is Grosse et al. (2023).
5 Data attribution
Generalised influence functions.