Transformer networks

The transformer-powered subtitle for this article is “Our most terrifyingly effective weapon against the forces of evil is our ability to laugh at them.”

Well, it’s really terribly simple, […] it works any way you want it to. You see, the computer that runs it is a rather advanced one. In fact it is more powerful than the sum total of all the computers on this planet including—and this is the tricky part—including itself.

— Douglas Adams, Dirk Gently’s Holistic Detective Agency

Transformers are big self-attention networks with some extra tricks — self attention, and a query, key, value mechanism, and usually a positional encoding as well.

I am no expert. Here are some good blog posts explaining everything, for my reference, but I will not write yet another one. This is a fast-moving area and I am not keeping track of it, so if you are on this page looking for guidance you are already in trouble.

  • Phuong and Hutter (2022)

    Transformers are deep feed-forward artificial neural networks with a (self)attention mechanism. They have been tremendously successful in natural language processing tasks and other domains. Since their inception 5 years ago, many variants have been suggested. Descriptions are usually graphical, verbal, partial, or incremental. Despite their popularity, it seems no pseudocode has ever been published for any variant. […] This report intends to rectify the situation for Transformers. It aims to be a self-contained, complete, precise and compact overview of transformer architectures and formal algorithms (but not results)

These networks are massive (heh) in natural language processing right now.

A key point about such networks seems to be that they can be made extremely large but still remain trainable. This leads to interesting scaling laws.

Power of

Transformers are pretty good at weird stuff, e.g. automata — see Unveiling Transformers with LEGO (Zhang et al. 2022).

How about Bayesian inference? (Müller et al. 2022)

Can they be an engine of intelligence? What do they do in society? etc. Controversial — see the Stochastic Parrots paper (Bender et al. 2021), and the entire internet commentariat from November 2022 onwards.

Image by Anthrupad

As set functions

Transformers are neural set functions (!).


State-space, i.e. recurrent transformers, without (classic) attention. Suggestive connection to S4 models.

RWKV is inspired by Apple’s Attention Free Transformer.(@ Zhai et al. 2021). …

How to combine the best of transformers and RNNs? The main drawback of transformer-based models is that it can become challenging to run a model with a context window that is larger than a certain value, as the attention scores are computed simultaneously for the entire sequence.

RNNs natively support very long context lengths—only limited by the context length seen in training, but this can be extended to millions of tokens with careful coding. Currently, there are RWKV models trained on a context length of 8192 (ctx8192) and they are as fast as ctx1024 models and require the same amount of RAM.

The major drawbacks of traditional RNN models and how RWKV is different:

  1. Traditional RNN models are unable to utilize very long contexts (LSTM can only manage ~100 tokens when used as a LM). However, RWKV can utilize thousands of tokens and beyond…
  2. Traditional RNN models cannot be parallelized when training. RWKV is similar to a “linearized GPT” and it trains faster than GPT.

By combining both advantages into a single architecture, the hope is that RWKV can grow to become more than the sum of its parts.

Tokens as recurrent state

See also RecurrentGPT (Zhou et al. 2023)

GitHub - aiwaves-cn/RecurrentGPT

RecurrentGPT replaces the vectorized elements (i.e., cell state, hidden state, input, and output) in a Long-short Term Memory RNN (LSTM) with natural language (i.e., paragraphs of texts), and simulates the recurrence mechanism with prompt engineering.

At each timestep t, RecurrentGPT receives a paragraph of text and a brief plan of the next paragraph, which are both generated in step t − 1. It then attends to the long-term memory, which contains the summaries of all previously generated paragraphs and can be stored on hard drives, and relevant paragraphs can be retrieved with semantic search.

RecurrentGPT also maintains a short-term memory that summarizes key information within recent timesteps in natural language and is updated at each time step. RecurrentGPT combines all aforementioned inputs in a prompt and asks the backbone LLM to generate a new paragraph, a short plan for the next paragraph, and updates the long-short term memory by rewriting the short-term memory and appending the summary of the output paragraph to the long-term memory.


Democratizing the hardware side of large language models seems to be an advertisement for some new hardware, but there is interesting background in there.

HuggingFace distributes and documents and implements a lot of Transformer/attention NLP models and seem to be the most active neural NLP project. Certainly too active to explain what they are up to in between pumping out all the code.

Embedding vector databases

See embedding vector databases.

LMQL generalizes natural language prompting, making it more expressive while remaining accessible. For this, LMQL builds on top of Python, allowing users to express natural language prompts that also contain code. The resulting queries can be directly executed on language models like OpenAI’s GPT models. Fixed answer templates and intermediate instructions allow the user to steer the LLM’s reasoning process.


Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate.” In. arXiv.
Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜.” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–23. Virtual Event Canada: ACM.
Beurer-Kellner, Luca, Marc Fischer, and Martin Vechev. 2022. Prompting Is Programming: A Query Language For Large Language Models.” arXiv.
Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. Language Models Are Few-Shot Learners.” arXiv:2005.14165 [Cs], June.
Bubeck, Sébastien, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, et al. 2023. Sparks of Artificial General Intelligence: Early Experiments with GPT-4.” arXiv.
Cao, Shuhao. 2021. Choose a Transformer: Fourier or Galerkin.” In Advances in Neural Information Processing Systems, 34:24924–40. Curran Associates, Inc.
Celikyilmaz, Asli, Li Deng, Lihong Li, and Chong Wang. 2017. Scaffolding Networks for Teaching and Learning to Comprehend.” arXiv:1702.08653 [Cs], February.
Choy, Christopher B, JunYoung Gwak, Silvio Savarese, and Manmohan Chandraker. 2016. Universal Correspondence Network.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 2406–14. Curran Associates, Inc.
Din, Alexander Yom, Taelin Karidi, Leshem Choshen, and Mor Geva. 2023. Jump to Conclusions: Short-Cutting Transformers With Linear Transformations.”
Ergen, Tolga, Behnam Neyshabur, and Harsh Mehta. 2022. Convexifying Transformers: Improving Optimization and Understanding of Transformer Networks.” arXiv.
Freeman, Alexandra L J. 2019. How to Communicate Evidence to Patients.” Drug and Therapeutics Bulletin 57 (8): 119–24.
Huang, Cheng-Zhi Anna, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck. 2018. Music Transformer: Generating Music with Long-Term Structure,” September.
Katharopoulos, Angelos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. Transformers Are RNNs: Fast Autoregressive Transformers with Linear Attention.” arXiv:2006.16236 [Cs, Stat], August.
Li, Zhuohan, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gonzalez. 2020. Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers.” arXiv:2002.11794 [Cs], February.
Merrill, William, and Ashish Sabharwal. 2022. Transformers Implement First-Order Logic with Majority Quantifiers.” arXiv.
Müller, Samuel, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. 2022. Transformers Can Do Bayesian Inference.” arXiv.
Nguyen, Tung, Johannes Brandstetter, Ashish Kapoor, Jayesh K. Gupta, and Aditya Grover. 2023. ClimaX: A Foundation Model for Weather and Climate.” arXiv.
Ortega, Pedro A., Markus Kunesch, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, et al. 2021. Shaking the Foundations: Delusions in Sequence Models for Interaction and Control.” arXiv:2110.10819 [Cs], October.
Phuong, Mary, and Marcus Hutter. 2022. Formal Algorithms for Transformers.” arXiv.
Piantadosi, Steven T., and Felix Hill. 2022. Meaning Without Reference in Large Language Models.” arXiv.
Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask Learners,” 24.
Ramsauer, Hubert, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Lukas Gruber, Markus Holzleitner, et al. 2020. Hopfield Networks Is All You Need.” arXiv:2008.02217 [Cs, Stat], July.
Vardasbi, Ali, Telmo Pessoa Pires, Robin M. Schmidt, and Stephan Peitz. 2023. State Spaces Aren’t Enough: Machine Translation Needs Attention.” arXiv.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need.” arXiv:1706.03762 [Cs], June.
Willig, Moritz, Matej Zečević, Devendra Singh Dhami, and Kristian Kersting. 2022. Can Foundation Models Talk Causality? arXiv.
Yang, Greg, and Edward J. Hu. 2020. Feature Learning in Infinite-Width Neural Networks.” arXiv:2011.14522 [Cond-Mat], November.
Zhai, Shuangfei, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Ruixiang Zhang, and Josh Susskind. 2021. An Attention Free Transformer.” arXiv.
Zhang, Yi, Arturs Backurs, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, and Tal Wagner. 2022. Unveiling Transformers with LEGO: A Synthetic Reasoning Task.” arXiv.
Zhou, Wangchunshu, Yuchen Eleanor Jiang, Peng Cui, Tiannan Wang, Zhenxin Xiao, Yifan Hou, Ryan Cotterell, and Mrinmaya Sachan. 2023. RecurrentGPT: Interactive Generation of (Arbitrarily) Long Text.” arXiv.
Zou, Andy, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models.” arXiv.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.