Transformer networks

Baby’s first foundation model

2017-12-20 — 2025-04-01

AI safety

language

machine learning

meta learning

neural nets

NLP

stringology

time series

Suspiciously similar content

Transformers are big attention networks with some extra tricks — self attention, and usually a positional encoding as well.

I am no expert. Here are some good blog posts explaining everything, for my reference, but I will not write yet another one. This is a fast-moving area and I am not keeping track of it, so if you are on this page looking for guidance you are already in trouble.

These networks are massive (heh) in natural language processing and a contender in the race to achieve generally superhuman capability.

A (the?) key feature of such networks seems to be that they can be made extremely large, and extremely capable, but still remain trainable. This leads to interesting scaling laws. The scaling dynamics have let to fascinating pricing and upended the bitter lesson dynamics.

1 Introductions

So many.

TODO: rank in terms of lay-person-friendliness.

You Could’ve Invented Transformers, by Gwern · Gwern.net
“Attention”, “Transformers”, in Neural Network “Large Language Models” I quite like Cosma’s characteristically idiosyncratic way of learning and re-deriving things. I found his contrarian statistician’s take on the transformer architecture to be enlightening.
[2502.17814] An Overview of Large Language Models for Statisticians
3b1b: But what is a GPT? Visual intro to transformers
Compact precise definition of a transformer function – foreXiv
Understanding einsum for Deep learning: implement a transformer with multi-head self-attention from scratch | AI Summer
Jay Alammar’s Illustrated Transformer series is good.
Lilian Weng, Large Transformer Model Inference Optimization
Lilian Weng, The Transformer Family Version 2.0
nostalgebraist, An exciting new paper on neural language models
1A - Scaled Dot Product Attention explained
John Thickstun, The Transformer Model in Equations
Large language models, explained with a minimum of math and jargon
[1hr Talk] Intro to Large Language Models
Jeremy Howard on X: Here, in full directly on Twitter, is “A Hackers’ Guide to Language Models”. This 90 minute tutorial is designed to be the one place I point coders at when they ask “hey, tell me everything I need to know about LLMs!” It covers both OpenAI models and open source ones in depth.
Large language models, explained with a minimum of math and jargon
A good paper read is Yannic Kilcher’s.
Xavier Amatriain, Transformer models: an introduction and catalog — 2023 Edition
Noam Shazeer’s Shape Suffixes post implicitly makes the case that transformers are simply a confusing way of smashing tensors together.

2 Power of

Transformers are pretty good at weird stuff, e.g. automata — see Unveiling Transformers with LEGO (Y. Zhang et al. 2022).

How about Bayesian inference? (Müller et al. 2021)

Can they be an engine of intelligence? Controversial — see the Stochastic Parrots paper (Bender et al. 2021), and the entire internet commentariat from November 2022 onwards.

What do they do in society? Who knows, but see AI democratization and AI economics.

3 As set functions

Transformers are neural set functions (!).

4 As recurrent state

See Transformers and RNNs

5 For forecasting of non-linguistic material

See Transformers and numerical time series.

6 Practicalities

For you and me, see AI democratization.

7 Embedding vector databases

See embedding vector databases.

8 Incoming

Transformers.js
LMQL: Programming Large Language Models: “LMQL is a programming language for language model interaction.” (Beurer-Kellner, Fischer, and Vechev 2022)

LMQL generalises natural language prompting, making it more expressive while remaining accessible. For this, LMQL builds on top of Python, allowing users to express natural language prompts that also contain code. The resulting queries can be directly executed on language models like OpenAI’s GPT models. Fixed answer templates and intermediate instructions allow the user to steer the LLM’s reasoning process.
How does in-context learning work? A framework for understanding the differences from traditional supervised learning

TL;DR — In-context learning is a mysterious emergent behaviour in large language models (LMs) where the LM performs a task just by conditioning on input-output examples, without optimising any parameters. In this post, we provide a Bayesian inference framework for understanding in-context learning as “locating” latent concepts the LM has acquired from pretraining data. This suggests that all components of the prompt (inputs, outputs, formatting, and the input-output mapping) can provide information for inferring the latent concept. We connect this framework to empirical evidence where in-context learning still works when provided training examples with random outputs. While output randomisation cripples traditional supervised learning algorithms, it only removes one source of information for Bayesian inference (the input-output mapping).
What Are the Different Approaches for Detecting Content Generated by LLMs Such As ChatGPT? And How Do They Work and Differ?
Large Language Models as General Pattern Machines

We observe that pre-trained large language models (LLMs) are capable of autoregressively completing complex token sequences—from arbitrary ones procedurally generated by probabilistic context-free grammars (PCFG), to more rich spatial patterns found in the Abstract Reasoning Corpus (ARC), a general AI benchmark, prompted in the style of ASCII art. Surprisingly, pattern completion proficiency can be partially retained even when the sequences are expressed using tokens randomly sampled from the vocabulary. These results suggest that without any additional training, LLMs can serve as general sequence modellers, driven by in-context learning. In this work, we investigate how these zero-shot capabilities may be applied to problems in robotics—from extrapolating sequences of numbers that represent states over time to complete simple motions, to least-to-most prompting of reward-conditioned trajectories that can discover and represent closed-loop policies (e.g., a stabilising controller for CartPole). While difficult to deploy today for real systems due to latency, context size limitations, and compute costs, the approach of using LLMs to drive low-level control may provide an exciting glimpse into how the patterns among words could be transferred to actions.
karpathy/nanoGPT: The simplest, fastest repository for training/finetuning medium-sized GPTs.

9 References

Bahdanau, Cho, and Bengio. 2015. “Neural Machine Translation by Jointly Learning to Align and Translate.” In.

Bender, Gebru, McMillan-Major, et al. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜.” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency.

Beurer-Kellner, Fischer, and Vechev. 2022. “Prompting Is Programming: A Query Language For Large Language Models.”

Brown, Mann, Ryder, et al. 2020. “Language Models Are Few-Shot Learners.” arXiv:2005.14165 [Cs].

Bubeck, Chandrasekaran, Eldan, et al. 2023. “Sparks of Artificial General Intelligence: Early Experiments with GPT-4.”

Cao. 2021. “Choose a Transformer: Fourier or Galerkin.” In Advances in Neural Information Processing Systems.

Celikyilmaz, Deng, Li, et al. 2017. “Scaffolding Networks for Teaching and Learning to Comprehend.” arXiv:1702.08653 [Cs].

Choy, Gwak, Savarese, et al. 2016. “Universal Correspondence Network.” In Advances in Neural Information Processing Systems 29.

Din, Karidi, Choshen, et al. 2023. “Jump to Conclusions: Short-Cutting Transformers With Linear Transformations.”

Dziri, Lu, Sclar, et al. 2023. “Faith and Fate: Limits of Transformers on Compositionality.”

Ergen, Neyshabur, and Mehta. 2022. “Convexifying Transformers: Improving Optimization and Understanding of Transformer Networks.”

Freeman. 2019. “How to Communicate Evidence to Patients.” Drug and Therapeutics Bulletin.

Gloeckler, Deistler, Weilbach, et al. 2024. “All-in-One Simulation-Based Inference.”

Huang, Vaswani, Uszkoreit, et al. 2018. “Music Transformer: Generating Music with Long-Term Structure.”

Katharopoulos, Vyas, Pappas, et al. 2020. “Transformers Are RNNs: Fast Autoregressive Transformers with Linear Attention.” arXiv:2006.16236 [Cs, Stat].

Kleinberg, and Mullainathan. 2024. “Language Generation in the Limit.”

Korbak, Perez, and Buckley. 2022. “RL with KL Penalties Is Better Viewed as Bayesian Inference.”

Li, Wallace, Shen, et al. 2020. “Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers.” arXiv:2002.11794 [Cs].

Merrill, and Sabharwal. 2022. “Transformers Implement First-Order Logic with Majority Quantifiers.”

Müller, Hollmann, Arango, et al. 2021. “Transformers Can Do Bayesian Inference.” In.

Nguyen, Brandstetter, Kapoor, et al. 2023. “ClimaX: A Foundation Model for Weather and Climate.”

Ortega, Kunesch, Delétang, et al. 2021. “Shaking the Foundations: Delusions in Sequence Models for Interaction and Control.” arXiv:2110.10819 [Cs].

Peng, Narayanan, and Papadimitriou. 2024. “On Limitations of the Transformer Architecture.”

Phuong, and Hutter. 2022. “Formal Algorithms for Transformers.”

Piantadosi, and Hill. 2022. “Meaning Without Reference in Large Language Models.”

Radford, Wu, Child, et al. 2019. “Language Models Are Unsupervised Multitask Learners.”

Rafailov, Sharma, Mitchell, et al. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.”

Ramsauer, Schäfl, Lehner, et al. 2020. “Hopfield Networks Is All You Need.” arXiv:2008.02217 [Cs, Stat].

Serrano, Brumbaugh, and Smith. 2023. “Language Models: A Guide for the Perplexed.”

Shai, Marzen, Teixeira, et al. 2025. “Transformers Represent Belief State Geometry in Their Residual Stream.”

Tan, Merrill, Gupta, et al. 2024. “Are Language Models Actually Useful for Time Series Forecasting?”

Vardasbi, Pires, Schmidt, et al. 2023. “State Spaces Aren’t Enough: Machine Translation Needs Attention.”

Vaswani, Shazeer, Parmar, et al. 2017. “Attention Is All You Need.” arXiv:1706.03762 [Cs].

Wang, Gangavarapu, Yan, et al. 2024. “MambaByte: Token-Free Selective State Space Model.”

Willig, Zečević, Dhami, et al. 2022. “Can Foundation Models Talk Causality?”

Wu, Tan, Wang, et al. 2024. “Beyond Language Models: Byte Models Are Digital World Simulators.”

Yang, and Hu. 2020. “Feature Learning in Infinite-Width Neural Networks.” arXiv:2011.14522 [Cond-Mat].

Yu, Xu, Weston, et al. 2024. “Distilling System 2 into System 1.”

Zekri, Odonnat, Benechehab, et al. 2025. “Large Language Models as Markov Chains.”

Zhai, Talbott, Srivastava, et al. 2021. “An Attention Free Transformer.”

Zhang, Yi, Backurs, Bubeck, et al. 2022. “Unveiling Transformers with LEGO: A Synthetic Reasoning Task.”

Zhang, Lunjun, Hosseini, Bansal, et al. 2024. “Generative Verifiers: Reward Modeling as Next-Token Prediction.”

Zhang, Edwin, Zhu, Saphra, et al. 2024. “Transcendence: Generative Models Can Outperform The Experts That Train Them.”

Zhao, Brekelmans, Makhzani, et al. 2024. “Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo.” In Proceedings of the 41st International Conference on Machine Learning.

Zhou, Jiang, Cui, et al. 2023. “RecurrentGPT: Interactive Generation of (Arbitrarily) Long Text.”

Zou, Wang, Kolter, et al. 2023. “Universal and Transferable Adversarial Attacks on Aligned Language Models.”