Neural mixtures of experts

Switching regression, mixture of experts

2016-03-29 — 2024-06-11

Wherein a consortium of expert networks is combined by a learned gating function that is trained to allocate inputs to specialists for distinct input regimes, and model selection is performed during inference.

Bayes
classification
clustering
compsci
density
information
linear algebra
model selection
nonparametric
optimization
particle
probability
regression
sparser than thou
statistics

Mixtures or model combinations — the gating/mixing function is itself learned.

Placeholder.

Figure 1

1 References

Cai, Jiang, Wang, et al. 2024. A Survey on Mixture of Experts.” arXiv.org.
Du, Huang, Dai, et al. 2022. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts.” In Proceedings of the 39th International Conference on Machine Learning.
Eigen, Ranzato, and Sutskever. 2014. Learning Factored Representations in a Deep Mixture of Experts.”
Hinton, Vinyals, and Dean. 2015. Distilling the Knowledge in a Neural Network.” arXiv:1503.02531 [Cs, Stat].
Masegosa. 2020. Learning Under Model Misspecification: Applications to Variational and Ensemble Methods.” In Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20.
Masoudnia, and Ebrahimpour. 2014. Mixture of Experts: A Literature Survey.” Artificial Intelligence Review.
Shazeer, Mirhoseini, Maziarz, et al. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.”
Waterhouse, MacKay, and Robinson. 1995. Bayesian Methods for Mixtures of Experts.” In Advances in Neural Information Processing Systems.
Zeevi, Meir, and Maiorov. 1998. Error Bounds for Functional Approximation and Estimation Using Mixtures of Experts.” IEEE Transactions on Information Theory.
Zhang, and Wang. 2022. Deep Learning Meets Nonparametric Regression: Are Weight-Decayed DNNs Locally Adaptive?
Zhou, Lei, Liu, et al. 2022. Mixture-of-Experts with Expert Choice Routing.” Advances in Neural Information Processing Systems.