Data summarization

On maps drawn at smaller than 1:1 scale

January 14, 2019 — August 22, 2024

approximation
estimator distribution
functional analysis
information
linear algebra
model selection
optimization
probabilistic algorithms
probability
signal processing
sparser than thou
statistics
Figure 1

Summary statistics that don’t require us to keep all the data but allow us to do inference nearly as well. e.g. sufficient statistics in exponential families allow you to do certain kinds of inference perfectly with just summaries. Methods such as variational Bayes summarize data by maintaining a posterior density as a summary of all the data likelihood, at some cost in accuracy. I think of these as nearly sufficient statistics but we could think of these as data summarization, which I note here for later reference.

TBC.

1 Bayes duality

2 Coresets

Bayesian. Solve an optimization problem to minimise distance between posterior with all data and with a weighted subset.

e.g. trevorcampbell/bayesian-coresets.

3 Representative subsets

I think this is intended to be generic? See apricot:

Figure 2

apricot implements submodular optimization for the purpose of summarising massive data sets into minimally redundant subsets that are still representative of the original data. These subsets are useful for both visualising the modalities in the data (such as in the two data sets below) and for training accurate machine learning models with just a fraction of the examples and compute.

4 By influence functions

Including data based on how much it changes the predictions. Classic approach (Cook 1977; Thomas and Cook 1990) for linear models. Can be applied also to DL models. A bayes-by-backprop approach is Nickl et al. (2023), and a heroic second-order method is Grosse et al. (2023).

Figure 3: From the Twitter summary of Grosse et al. (2023)

5 Data attribution

Generalised influence functions.

6 References

Agrawal, Uhler, and Broderick. 2018. Minimal I-MAP MCMC for Scalable Structure Discovery in Causal DAG Models.” In International Conference on Machine Learning.
Bachem, Lucic, and Krause. 2015. Coresets for Nonparametric Estimation - the Case of DP-Means.” In International Conference on Machine Learning.
———. 2017. Practical Coreset Constructions for Machine Learning.” arXiv Preprint arXiv:1703.06476.
Broderick, Boyd, Wibisono, et al. 2013. Streaming Variational Bayes.” In Advances in Neural Information Processing Systems 26.
Campbell, and Broderick. 2017. Automated Scalable Bayesian Inference via Hilbert Coresets.” arXiv:1710.05053 [Cs, Stat].
———. 2018. Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent.” In International Conference on Machine Learning.
Cook. 1977. Detection of Influential Observation in Linear Regression.” Technometrics.
Cortes, Kuznetsov, Mohri, et al. 2016. Structured Prediction Theory Based on Factor Graph Complexity.” In Advances in Neural Information Processing Systems 29.
Engstrom, Feldmann, and Madry. 2024. DsDm: Model-Aware Dataset Selection with Datamodels.”
Grosse, Bae, Anil, et al. 2023. Studying Large Language Model Generalization with Influence Functions.”
Hensman, Fusi, and Lawrence. 2013. Gaussian Processes for Big Data.” In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence. UAI’13.
Huggins, Jonathan, Adams, and Broderick. 2017. PASS-GLM: Polynomial Approximate Sufficient Statistics for Scalable Bayesian GLM Inference.” In Advances in Neural Information Processing Systems 30.
Huggins, Jonathan H., Campbell, and Broderick. 2016. Coresets for Scalable Bayesian Logistic Regression.” arXiv:1605.06423 [Cs, Stat].
Huggins, Jonathan H., Campbell, Kasprzak, et al. 2018a. Scalable Gaussian Process Inference with Finite-Data Mean and Variance Guarantees.” arXiv:1806.10234 [Cs, Stat].
———, et al. 2018b. Practical Bounds on the Error of Bayesian Posterior Approximations: A Nonasymptotic Approach.” arXiv:1809.09505 [Cs, Math, Stat].
Nickl, Xu, Tailor, et al. 2023. The Memory-Perturbation Equation: Understanding Model’s Sensitivity to Data.” In.
Park, Georgiev, Ilyas, et al. 2023. TRAK: Attributing Model Behavior at Scale.”
Thomas, and Cook. 1990. Assessing Influence on Predictions from Generalized Linear Models.” Technometrics.
Titsias. 2009. Variational Learning of Inducing Variables in Sparse Gaussian Processes.” In International Conference on Artificial Intelligence and Statistics.