Neural learning of exchangeable functions, a certain type of symmetry constraint or respecting projectivity. Why might we do this? Well, for example, a learnable statistic of i.i.d observations is an exchangeable function of the data.

TBD

## Permutation invariant observations

Zaheer et al. (2017) is a short and beautiful paper.

A function \(f: 2^{\mathfrak{X}} \rightarrow \mathcal{Y}\) acting on sets must be permutation invariant to the order of objects in the set, i.e. for any permutation \(\pi: f\left(\left\{x_1, \ldots, x_M\right\}\right)=f\left(\left\{x_{\pi(1)}, \ldots, x_{\pi(M)}\right\}\right)\).

We use functions like this all the time; empirical statistics are of this form this form.
We also use *permutation equivariant* functions:

\(\mathbf{f}: \mathfrak{X}^M \rightarrow \mathcal{Y}^M\) that upon permutation of the input instances permutes the output labels, i.e. for any permutation \(\pi\) : \[ \mathbf{f}\left(\left[x_{\pi(1)}, \ldots, x_{\pi(M)}\right]\right)=\left[f_{\pi(1)}(\mathbf{x}), \ldots, f_{\pi(M)}(\mathbf{x})\right] \]

The main results of Zaheer et al. (2017) are:

Theorem 2:A function \(f(X)\) operating on a set \(X\) having elements from a countable universe, is a valid set function, i.e., invariant to the permutation of instances in \(X\), iff it can be decomposed in the form \(\rho\left(\sum_{x \in X} \phi(x)\right)\), for suitable transformations \(\phi\) and \(\rho\).

This is not proven in general for uncountable universes, but the authors assert it is true if some sane topology is applied to the space.

Equivariance:

Next, we analyze the equivariant case when \(\mathfrak{X}=\mathcal{Y}=\mathbb{R}\) and \(\mathbf{f}\) is restricted to be a neural network layer. The standard neural network layer is represented as \(\mathbf{f}_{\Theta}(\mathbf{x})=\boldsymbol{\sigma}(\Theta \mathbf{x})\) where \(\Theta \in \mathbb{R}^{M \times M}\) is the weight vector and \(\sigma: \mathbb{R} \rightarrow \mathbb{R}\) is a nonlinearity such as sigmoid function. The following lemma states the necessary and sufficient conditions for permutation-equivariance in this type of function.

Lemma 3:The function \(\mathbf{f}_{\Theta}: \mathbb{R}^M \rightarrow \mathbb{R}^M\) defined above is permutation equivariant iff all the offdiagonal elements of \(\Theta\) are tied together and all the diagonal elements are equal as well. That is, \(\Theta=\lambda \mathbf{I}+\gamma\left(\mathbf{1 1}^{\top}\right) \quad \lambda, \gamma \in \mathbb{R} \quad \mathbf{1}=[1, \ldots, 1]^{\top} \in \mathbb{R}^M \quad \mathbf{I} \in \mathbb{R}^{M \times M}\) is the identity matrix This result can be easily extended to higher dimensions, i.e., \(\mathfrak{X}=\mathbb{R}^d\) when \(\lambda, \gamma\) can be matrices.

Original implementation is here: manzilzaheer/DeepSets; imeplentation in terms of NNs is fairly obvious; we just make \(\phi\) and \(\rho\) into NNs.

See also Murphy et al. (2022) and Sainsbury-Dale, Zammit-Mangion, and Huser (2022).

Great explainer by the authors of Wagstaff et al. (2019), which is an essential update to Zaheer et al. (2017).
DeepSets: Modeling Permutation Invariance.
Update: That same blog post without broken image links.
**tl;dr** for continuous sets we need to know an upper bound on set size.

## Neural statistics

Related results in Edwards and Storkey (2017). Not much cross-citation between these lineages; I wonder about how they overlap.

## Neural processes

Cool connection β neural process regression leverages deep sets, since the conditionals are requried to be equivariant with exchangeable inputs.

## Attention

Some attention networks, in particular transformers, I have been informed by my colleagues Tom Blau and Xuesong Wang, are worth investigating for permutation invariance. Indeed Fabian Fuchs et al note that:

Self-attention via {keys, queries, and values} as in the Attention Is All You Need paper (Vaswani et al. 2017) is closely linked to Deep Sets. Self-attention is itself permutation-invariant unless you use positional encoding as often done in language applications. In a way, self-attention βgeneralisesβ the summation operation as it performs a weighted summation of different attention vectors. You can show that when setting all keys and queries to 1.0, you effectively end up with the Deep Sets architecture.

## References

*Proceedings of ICLR*.

*Proceedings of the 38th International Conference on Machine Learning*, 4297β307. PMLR.

*Proceedings of the 36th International Conference on Machine Learning*, 4363β71. PMLR.

*arXiv:1706.03762 [Cs]*, June.

*Proceedings of the 36th International Conference on Machine Learning*, 6487β94. PMLR.

*Constructive Approximation*55 (1): 407β74.

*Advances in Neural Information Processing Systems*. Vol. 30. Curran Associates, Inc.

## No comments yet. Why not leave one?