Data summarization

On maps drawn at smaller than 1:1 scale

Summary statistics which don’t require us to keep all the data but which allow us to nonetheless do inference nearly as well. e.g sufficient statistics in exponential families allow you to do certain kind of inference perfectly without anything except summaries. Methods such as variational Bayes summarize data by maintaining a posterior density as a summary of all the data likelihood, at some cost in accuracy. I think of these as nearly sufficient statistics but we could thinkg of these data summarization which I note here for later reference.



Bayesian. Solve an optimisation problem to minimise distance between posterior with all data and with a weighted subset.

Representative subsets

I think this is intended to be generic? See apricot:

apricot implements submodular optimization for the purpose of summarizing massive data sets into minimally redundant subsets that are still representative of the original data. These subsets are useful for both visualizing the modalities in the data (such as in the two data sets below) and for training accurate machine learning models with just a fraction of the examples and compute.


Agrawal, Raj, Caroline Uhler, and Tamara Broderick. 2018. β€œMinimal I-MAP MCMC for Scalable Structure Discovery in Causal DAG Models.” In International Conference on Machine Learning, 89–98.
Bachem, Olivier, Mario Lucic, and Andreas Krause. 2015. β€œCoresets for Nonparametric Estimation - the Case of DP-Means.” In International Conference on Machine Learning, 209–17.
β€”β€”β€”. 2017. β€œPractical Coreset Constructions for Machine Learning.” arXiv Preprint arXiv:1703.06476.
Broderick, Tamara, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael I Jordan. 2013. β€œStreaming Variational Bayes.” In Advances in Neural Information Processing Systems 26, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 1727–35. Curran Associates, Inc.
Campbell, Trevor, and Tamara Broderick. 2017. β€œAutomated Scalable Bayesian Inference via Hilbert Coresets.” arXiv:1710.05053 [Cs, Stat], October.
β€”β€”β€”. 2018. β€œBayesian Coreset Construction via Greedy Iterative Geodesic Ascent.” In International Conference on Machine Learning, 698–706.
Cortes, Corinna, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. 2016. β€œStructured Prediction Theory Based on Factor Graph Complexity.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 2514–22. Curran Associates, Inc.
Hensman, James, NicolΓ² Fusi, and Neil D. Lawrence. 2013. β€œGaussian Processes for Big Data.” In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, 282–90. UAI’13. Arlington, Virginia, USA: AUAI Press.
Huggins, Jonathan H., Trevor Campbell, and Tamara Broderick. 2016. β€œCoresets for Scalable Bayesian Logistic Regression.” arXiv:1605.06423 [Cs, Stat], May.
Huggins, Jonathan H., Trevor Campbell, MikoΕ‚aj Kasprzak, and Tamara Broderick. 2018a. β€œScalable Gaussian Process Inference with Finite-Data Mean and Variance Guarantees.” arXiv:1806.10234 [Cs, Stat], June.
β€”β€”β€”. 2018b. β€œPractical Bounds on the Error of Bayesian Posterior Approximations: A Nonasymptotic Approach.” arXiv:1809.09505 [Cs, Math, Stat], September.
Huggins, Jonathan, Ryan P Adams, and Tamara Broderick. 2017. β€œPASS-GLM: Polynomial Approximate Sufficient Statistics for Scalable Bayesian GLM Inference.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 3611–21. Curran Associates, Inc.
Titsias, Michalis K. 2009. β€œVariational Learning of Inducing Variables in Sparse Gaussian Processes.” In International Conference on Artificial Intelligence and Statistics, 567–74. PMLR.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.