Data summarization

On maps drawn at smaller than 1:1 scale

January 14, 2019 — February 8, 2024

approximation
estimator distribution
functional analysis
information
linear algebra
model selection
optimization
probabilistic algorithms
probability
signal processing
sparser than thou
statistics

Summary statistics which don’t require us to keep all the data but which allow us to nonetheless do inference nearly as well. e.g sufficient statistics in exponential families allow you to do certain kind of inference perfectly without anything except summaries. Methods such as variational Bayes summarize data by maintaining a posterior density as a summary of all the data likelihood, at some cost in accuracy. I think of these as nearly sufficient statistics but we could thinkg of these data summarization which I note here for later reference.

TBC.

2 Coresets

Bayesian. Solve an optimisation problem to minimise distance between posterior with all data and with a weighted subset.

3 Representative subsets

I think this is intended to be generic? See apricot:

apricot implements submodular optimization for the purpose of summarizing massive data sets into minimally redundant subsets that are still representative of the original data. These subsets are useful for both visualizing the modalities in the data (such as in the two data sets below) and for training accurate machine learning models with just a fraction of the examples and compute.

4 By influence functions

Including data based on how much it changes the predictions. Classic approach for linear models. Can be applied also to DL models. A bayes-by-backprop approach is Nickl et al. (2023), and a heroic second-order method is Grosse et al. (2023).

5 References

Agrawal, Uhler, and Broderick. 2018. In International Conference on Machine Learning.
Bachem, Lucic, and Krause. 2015. In International Conference on Machine Learning.
———. 2017. arXiv Preprint arXiv:1703.06476.
Broderick, Boyd, Wibisono, et al. 2013. In Advances in Neural Information Processing Systems 26.
Campbell, and Broderick. 2017. arXiv:1710.05053 [Cs, Stat].
———. 2018. In International Conference on Machine Learning.
Cook. 1977. Technometrics.
Cortes, Kuznetsov, Mohri, et al. 2016. In Advances in Neural Information Processing Systems 29.
Grosse, Bae, Anil, et al. 2023.
Hensman, Fusi, and Lawrence. 2013. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence. UAI’13.
Huggins, Jonathan, Adams, and Broderick. 2017. In Advances in Neural Information Processing Systems 30.
Huggins, Jonathan H., Campbell, and Broderick. 2016. arXiv:1605.06423 [Cs, Stat].
Huggins, Jonathan H., Campbell, Kasprzak, et al. 2018a. arXiv:1806.10234 [Cs, Stat].
———, et al. 2018b. arXiv:1809.09505 [Cs, Math, Stat].
Nickl, Xu, Tailor, et al. 2023. In.
Thomas, and Cook. 1990. Technometrics.
Titsias. 2009. In International Conference on Artificial Intelligence and Statistics.