Transformer networks

The transformer-powered subtitle recommendation for this article was “Our most terrifyingly effective weapon against the forces of evil is our ability to laugh at them.”

Transformers are big attention networks with some extra tricks. I am no expert. Here are some good blog posts explaining everything, for my reference, but I will not write yet another one. This is a fast-moving area and I am not keeping track of it, so if you are on this page looking for guidance you are already in trouble.

These networks are massive (heh) in natural language processing right now.

A key point about these networks seems to be that they can be made extremely large but still remain trainable. This leads to interesting scaling laws.

A good paper read is Yannic Kilcher’s.

Power of

Transformers are pretty good at weird stuff, e.g. automata — see Unveiling Transformers with LEGO (Zhang et al. 2022).

How about Bayesian inference? (Müller et al. 2022)


Democratizing the hardware side of large language models seems to be an advertisement for some new hardware, but there is interesting background in there.

HuggingFace distributes and documents and implements a lot of Transformer/attention NLP models and seem to be the most active neural NLP project. Certainly too active to explain what they are up to in between pumping out all the code.

The library currently contains PyTorch and Tensorflow implementations, pre-trained model weights, usage scripts and conversion utilities for the following models:

  1. BERT (from Google) released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
  2. GPT (from OpenAI) released with the paper Improving Language Understanding by Generative Pre-Training by Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.
  3. GPT-2 (from OpenAI) released with the paper Language Models are Unsupervised Multitask Learners by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
  4. Transformer-XL (from Google/CMU) released with the paper Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context by Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov.
  5. XLNet (from Google/CMU) released with the paper ​XLNet: Generalized Autoregressive Pretraining for Language Understanding by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le.
  6. XLM (from Facebook) released together with the paper Cross-lingual Language Model Pretraining by Guillaume Lample and Alexis Conneau.
  7. RoBERTa (from Facebook), released together with the paper a Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.
  8. DistilBERT (from HuggingFace) released together with the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter by Victor Sanh, Lysandre Debut, and Thomas Wolf. The same method has been applied to compress GPT2 into DistilGPT2.
  9. [very long list excised]


GPT-Neo is the code name for a series of transformer-based language models loosely styled around the GPT architecture that we plan to train and open source. Our primary goal is to replicate a GPT-3 sized model and open source it to the public, for free.

Along the way we will be running experiments with alternative architectures and attention types, releasing any intermediate models, and writing up any findings on our blog.

It is unclear if they will release the actual weights, but you can use a miniature GPT-alike at contentyze.

UPDATE: Is GPT-J-6B: 6B JAX-Based Transformer in the same family? THAT seems to be open and available. Some background to that and the other open options here: Alberto Romero, Can’t Access GPT-3? Here’s GPT-J, Its Open-Source Cousin

This guide to pruning multihead attention NN should probably go somewhere useful if i actually end up doing NLP like all the recruiters seem to want.


Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate.” In arXiv:1409.0473 [Cs, Stat].
Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. Language Models Are Few-Shot Learners.” arXiv:2005.14165 [Cs], June.
Celikyilmaz, Asli, Li Deng, Lihong Li, and Chong Wang. 2017. Scaffolding Networks for Teaching and Learning to Comprehend.” arXiv:1702.08653 [Cs], February.
Choy, Christopher B, JunYoung Gwak, Silvio Savarese, and Manmohan Chandraker. 2016. Universal Correspondence Network.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 2406–14. Curran Associates, Inc.
Freeman, Alexandra L J. 2019. How to Communicate Evidence to Patients.” Drug and Therapeutics Bulletin 57 (8): 119–24.
Huang, Cheng-Zhi Anna, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck. 2018. Music Transformer: Generating Music with Long-Term Structure,” September.
Katharopoulos, Angelos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. Transformers Are RNNs: Fast Autoregressive Transformers with Linear Attention.” arXiv:2006.16236 [Cs, Stat], August.
Li, Zhuohan, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gonzalez. 2020. Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers.” arXiv:2002.11794 [Cs], February.
Merrill, William, and Ashish Sabharwal. 2022. Transformers Implement First-Order Logic with Majority Quantifiers.” arXiv.
Müller, Samuel, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. 2022. Transformers Can Do Bayesian Inference.” arXiv.
Ortega, Pedro A., Markus Kunesch, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, et al. 2021. Shaking the Foundations: Delusions in Sequence Models for Interaction and Control.” arXiv:2110.10819 [Cs], October.
Phuong, Mary, and Marcus Hutter. 2022. Formal Algorithms for Transformers.” arXiv.
Piantadosi, Steven T., and Felix Hill. 2022. Meaning Without Reference in Large Language Models.” arXiv.
Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask Learners,” 24.
Ramsauer, Hubert, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Lukas Gruber, Markus Holzleitner, et al. 2020. Hopfield Networks Is All You Need.” arXiv:2008.02217 [Cs, Stat], July.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need.” arXiv:1706.03762 [Cs], June.
Yang, Greg, and Edward J. Hu. 2020. Feature Learning in Infinite-Width Neural Networks.” arXiv:2011.14522 [Cond-Mat], November.
Zhang, Yi, Arturs Backurs, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, and Tal Wagner. 2022. Unveiling Transformers with LEGO: A Synthetic Reasoning Task.” arXiv.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.