Neural net attention mechanisms

On brilliance through selective ignorance

2017-12-20 — 2022-08-05

Wherein the structure of transformer stacks and self‑attention layers is described, the role in processing sequential data such as text is examined, and recent optimizations such as FlashAttention are noted.

language
machine learning
neural nets
NLP
networks
Figure 1

Attention, self-attention… What are these things? I am no expert, so see some good blog posts explaining everything:

There is a lot of activity in a particular type of attention network, the transformer, which is a neural network architecture that is very good at processing tokenized data, such as text. The transformer is a stack of attention layers, and the attention mechanism is the key to its success.

1 Graph models of attention

2 Incoming

3 References

Bahdanau, Cho, and Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate.” In.
Bodnar, Bruinsma, Lucic, et al. 2024. Aurora: A Foundation Model of the Atmosphere.”
Cao. 2021. Choose a Transformer: Fourier or Galerkin.” In Advances in Neural Information Processing Systems.
Celikyilmaz, Deng, Li, et al. 2017. Scaffolding Networks for Teaching and Learning to Comprehend.” arXiv:1702.08653 [Cs].
Chen, Chen, Wan, et al. 2021. An Improved Data-Free Surrogate Model for Solving Partial Differential Equations Using Deep Neural Networks.” Scientific Reports.
Choy, Gwak, Savarese, et al. 2016. Universal Correspondence Network.” In Advances in Neural Information Processing Systems 29.
Guan, Wu, Zhao, et al. 2025. Attention Mechanisms Perspective: Exploring LLM Processing of Graph-Structured Data.”
Khatri, Laakkonen, Liu, et al. 2024. On the Anatomy of Attention.”
Kim, Mnih, Schwarz, et al. 2019. Attentive Neural Processes.”
Kuratov, Bulatov, Anokhin, et al. 2024. In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss.”
Laban, Fabbri, Xiong, et al. 2024. Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems.” In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.
Luong, Pham, and Manning. 2015. Effective Approaches to Attention-Based Neural Machine Translation.”
Modarressi, Deilamsalehy, Dernoncourt, et al. 2025. NoLiMa: Long-Context Evaluation Beyond Literal Matching.”
Ortega, Kunesch, Delétang, et al. 2021. Shaking the Foundations: Delusions in Sequence Models for Interaction and Control.” arXiv:2110.10819 [Cs].
Qin, Zhu, Qin, et al. 2019. Recurrent Attentive Neural Process for Sequential Data.”
Ramsauer, Schäfl, Lehner, et al. 2020. Hopfield Networks Is All You Need.” arXiv:2008.02217 [Cs, Stat].
Vaswani, Shazeer, Parmar, et al. 2017. Attention Is All You Need.” arXiv:1706.03762 [Cs].
Vasylenko, Treviso, and Martins. 2025. Long-Context Generalization with Sparse Attention.”
Veličković, Cucurull, Casanova, et al. 2018. Graph Attention Networks.”
Veličković, Perivolaropoulos, Barbero, et al. 2025. Softmax Is Not Enough (for Sharp Size Generalisation).” In.
Yang, and Hu. 2020. Feature Learning in Infinite-Width Neural Networks.” arXiv:2011.14522 [Cond-Mat].