Deep sets

invariant and equivariant functions

Neural learning of exchangeable functions, a certain type of symmetry constraint or respecting projectivity. Why might we do this? Well, for example, a learnable statistic of i.i.d observations is an exchangeable function of the data.


Permutation invariant observations

Zaheer et al. (2017) is a short and beautiful paper.

A function \(f: 2^{\mathfrak{X}} \rightarrow \mathcal{Y}\) acting on sets must be permutation invariant to the order of objects in the set, i.e. for any permutation \(\pi: f\left(\left\{x_1, \ldots, x_M\right\}\right)=f\left(\left\{x_{\pi(1)}, \ldots, x_{\pi(M)}\right\}\right)\).

We use functions like this all the time; empirical statistics are of this form this form. We also use permutation equivariant functions:

\(\mathbf{f}: \mathfrak{X}^M \rightarrow \mathcal{Y}^M\) that upon permutation of the input instances permutes the output labels, i.e. for any permutation \(\pi\) : \[ \mathbf{f}\left(\left[x_{\pi(1)}, \ldots, x_{\pi(M)}\right]\right)=\left[f_{\pi(1)}(\mathbf{x}), \ldots, f_{\pi(M)}(\mathbf{x})\right] \]

The main results of Zaheer et al. (2017) are:

Theorem 2: A function \(f(X)\) operating on a set \(X\) having elements from a countable universe, is a valid set function, i.e., invariant to the permutation of instances in \(X\), iff it can be decomposed in the form \(\rho\left(\sum_{x \in X} \phi(x)\right)\), for suitable transformations \(\phi\) and \(\rho\).

This is not proven in general for uncountable universes, but the authors assert it is true if some sane topology is applied to the space.


Next, we analyze the equivariant case when \(\mathfrak{X}=\mathcal{Y}=\mathbb{R}\) and \(\mathbf{f}\) is restricted to be a neural network layer. The standard neural network layer is represented as \(\mathbf{f}_{\Theta}(\mathbf{x})=\boldsymbol{\sigma}(\Theta \mathbf{x})\) where \(\Theta \in \mathbb{R}^{M \times M}\) is the weight vector and \(\sigma: \mathbb{R} \rightarrow \mathbb{R}\) is a nonlinearity such as sigmoid function. The following lemma states the necessary and sufficient conditions for permutation-equivariance in this type of function. Lemma 3: The function \(\mathbf{f}_{\Theta}: \mathbb{R}^M \rightarrow \mathbb{R}^M\) defined above is permutation equivariant iff all the offdiagonal elements of \(\Theta\) are tied together and all the diagonal elements are equal as well. That is, \(\Theta=\lambda \mathbf{I}+\gamma\left(\mathbf{1 1}^{\top}\right) \quad \lambda, \gamma \in \mathbb{R} \quad \mathbf{1}=[1, \ldots, 1]^{\top} \in \mathbb{R}^M \quad \mathbf{I} \in \mathbb{R}^{M \times M}\) is the identity matrix This result can be easily extended to higher dimensions, i.e., \(\mathfrak{X}=\mathbb{R}^d\) when \(\lambda, \gamma\) can be matrices.

Original implementation is here: manzilzaheer/DeepSets; imeplentation in terms of NNs is fairly obvious; we just make \(\phi\) and \(\rho\) into NNs.

See also Murphy et al. (2022) and Sainsbury-Dale, Zammit-Mangion, and Huser (2022).

Great explainer by the authors of Wagstaff et al. (2019), which is an essential update to Zaheer et al. (2017). DeepSets: Modeling Permutation Invariance. Update: That same blog post without broken image links. tl;dr for continuous sets we need to know an upper bound on set size.

Neural statistics

Related results in Edwards and Storkey (2017). Not much cross-citation between these lineages; I wonder about how they overlap.

Neural processes

Cool connection β€” neural process regression leverages deep sets, since the conditionals are requried to be equivariant with exchangeable inputs.


Some attention networks, in particular transformers, I have been informed by my colleagues Tom Blau and Xuesong Wang, are worth investigating for permutation invariance. Indeed Fabian Fuchs et al note that:

Self-attention via {keys, queries, and values} as in the Attention Is All You Need paper (Vaswani et al. 2017) is closely linked to Deep Sets. Self-attention is itself permutation-invariant unless you use positional encoding as often done in language applications. In a way, self-attention β€œgeneralises” the summation operation as it performs a weighted summation of different attention vectors. You can show that when setting all keys and queries to 1.0, you effectively end up with the Deep Sets architecture.


Azizian, WaΓ―ss, and Marc Lelarge. 2021. β€œExpressive Power of Invariant and Equivariant Graph Neural Networks.” arXiv.
Bloem-Reddy, Benjamin, and Yee Whye Teh. 2020. β€œProbabilistic Symmetries and Invariant Neural Networks.” arXiv.
Edwards, Harrison, and Amos Storkey. 2017. β€œTowards a Neural Statistician.” In Proceedings of ICLR.
Holderrieth, Peter, Michael J. Hutchinson, and Yee Whye Teh. 2021. β€œEquivariant Learning of Stochastic Fields: Gaussian Processes and Steerable Conditional Neural Processes.” In Proceedings of the 38th International Conference on Machine Learning, 4297–307. PMLR.
Maron, Haggai, Ethan Fetaya, Nimrod Segol, and Yaron Lipman. 2019. β€œOn the Universality of Invariant Networks.” In Proceedings of the 36th International Conference on Machine Learning, 4363–71. PMLR.
Murphy, Ryan L., Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. 2022. β€œJanossy Pooling: Learning Deep Permutation-Invariant Functions for Variable-Size Inputs.” In.
Navon, Aviv, Aviv Shamsian, Idan Achituve, Ethan Fetaya, Gal Chechik, and Haggai Maron. 2023. β€œEquivariant Architectures for Learning in Deep Weight Spaces.” arXiv.
Pevny, Tomas, and Vojtech Kovarik. 2019. β€œApproximation Capability of Neural Networks on Spaces of Probability Measures and Tree-Structured Domains.” arXiv.
Pevny, Tomas, and Petr Somol. 2017a. β€œDiscriminative Models for Multi-Instance Problems with Tree-Structure.” arXiv.
β€”β€”β€”. 2017b. β€œUsing Neural Network Formalism to Solve Multiple-Instance Problems.” arXiv.
Sainsbury-Dale, Matthew, Andrew Zammit-Mangion, and RaphaΓ«l Huser. 2022. β€œFast Optimal Estimation with Intractable Models Using Permutation-Invariant Neural Networks.” arXiv.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. β€œAttention Is All You Need.” arXiv:1706.03762 [Cs], June.
Wagstaff, Edward, Fabian Fuchs, Martin Engelcke, Ingmar Posner, and Michael A. Osborne. 2019. β€œOn the Limitations of Representing Functions on Sets.” In Proceedings of the 36th International Conference on Machine Learning, 6487–94. PMLR.
Yarotsky, Dmitry. 2022. β€œUniversal Approximations of Invariant Maps by Neural Networks.” Constructive Approximation 55 (1): 407–74.
Zaheer, Manzil, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. 2017. β€œDeep Sets.” In Advances in Neural Information Processing Systems. Vol. 30. Curran Associates, Inc.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.