Operationalising the bitter lessons in compute and cleverness

Amortizing the cost of being smart

2022-01-14 — 2025-11-03

Wherein the economics of compute and memorisation is considered, scaling and amortisation are weighed against data scarcity, substitution of training and inference compute is examined, and a hydrology datum is noted to cost AUD 700,000.

bounded compute

functional analysis

machine learning

model selection

optimization

statmech

when to compute

What to compute, and when, to make inferences about the world most efficiently.

Here are some observations.

Scaling curves of LLMs suggest that we can spend a predictably large amount of compute at training time and then do a lot of useful work at inference time.
Classic amortization methods in neural networks suggest that we can trade off training and inference costs by carefully designing our model architecture and training procedure.
Sutton’s folk-wisdom bitter lesson suggests that, in the long run, investing in compute tends to outperform investing in clever algorithms.
The economics of AI/labour substitution lead us to ask in what sense compute is a kind of labour, and how to trade off different kinds of compute resources.

I wonder whether there is a general economic model that incorporates the value of information and the cost of compute. If so, I think it would also need to incorporate the notions of memorisation and extrapolation, and trade off data, computational complexity, and similar factors. To me, this looks like a fundamentally economic theory. How much can one kind of computation substitute for another? It might instead be something “deeper” like a fundamental theory of intelligence, but pure economics looks to me like an easier nut to crack.

In that light, foundation models represent an ingenious amortization strategy in the classical sense. Maybe they’re, in fact, an interesting financial instrument in the currency of “cognition”? Or maybe we should think about more classic production cost curves?

I’m not sure, but intuitively there seem to be some interesting models here.

I’m not selling a comprehensive theory in this blog post, not yet. But if I can squeeze out the time to research this, maybe I can develop one. For now, just some unvarnished thoughts on the interplay between computation, information, and the economics of machine learning.

1 Word salad

Overparameterization and inductive biases. Operationalizing the scaling hypothesis. Compute overhangs. Tokenomics. Amortization of inference. Reduced order modelling. Training time versus inference time. The business model of FLOPS.

Concretely, traditional statistics was shaken when it turned out that labouriously proving things wasn’t as valuable as training a neural network to do the same thing. The field was upset again when it turned out that Foundation models and LLMs made it even weirder: it can be worthwhile to spend a lot of money training a gigantic model on all the data we can get, then use that model for very cheap inference. Essentially, we can pay once for a very expensive training run and thereafter everything is cheap.

A lot of ML problems, in hindsight, are implicitly about when to spend our compute budget. I think it’s useful to group the normative question of doing this well under the heading of when to compute.

2 Train time versus inference time

TBD

3 Amortisation

Bayesians call this the trade-off involved in learning inferential shortcuts amortization. To be continued.

4 In reinforcement learning

RL is an interestingly stark example of when to compute. In some cases, we can spend a lot of compute at training time to learn a policy that is very cheap to execute at inference time. In many RL algorithms, notably ones trained on a pure physical model this is a naked attempt to speed up a thing that we could already calculate. We could in princple calculate the optimal action at each time step by solving a dynamic programming problem, but this is often computationally intractable. Instead, we can use RL to learn a policy that approximates the optimal action.

5 Computing and remembering

TBD. For now, see NN memorization.

6 General compute efficiency

We’re getting very good at using hardware efficiently (Grace 2013). The piece AI and efficiency (Hernandez and Brown 2020) makes this clear:

We’re releasing an analysis showing that since 2012 the amount of compute needed to train a neural net to the same performance on ImageNet classification has been decreasing by a factor of 2 every 16 months. Compared to 2012, it now takes 44 times less compute to train a neural network to the level of AlexNet (by contrast, Moore’s Law would yield an 11x cost improvement over this period). Our results suggest that for AI tasks with high levels of recent investment, algorithmic progress has yielded more gains than classical hardware efficiency.

7 Metascience

Sutton’s famous bitter lesson set the terminology:

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.

We make this point a lot — for example, On the futility of trying to be clever (the bitter lesson redux).

Alternative phrasing: Even the best human minds aren’t very good; the quickest path to intelligence prioritizes replacing that human bottleneck.

Things work better when they can scale up. We should deploy the compute we have effectively.

Figure 2: Via Gwern; can no longer find original link because Twitter is being weird

Ermin Orhan argues:

I do think that there are many interesting research directions consistent with the philosophy of the bitter lesson that can have more meaningful, longer-term impact than small algorithmic or architectural tweaks. I just want to wrap up this post by giving a couple of examples below:

Probing the limits of model capabilities as a function of training data size: can we get to something close to human-level machine translation by leveraging everything multi-lingual on the web (I think we’ve learned that the answer to this is basically yes)? Can we get to something similar to human-level language understanding by scaling up the GPT-3 approach a couple of orders of magnitude (I think the answer to this is probably we don’t know yet)? Of course, smaller scale, less ambitious versions of these questions are also incredibly interesting and important.

Finding out what we can learn from different kinds of data and how what we learn differs as a function of this: e.g. learning from raw video data vs. learning from multi-modal data received by an embodied agent interacting with the world; learning from pure text vs. learning from text + images or text + video.

Coming up with new model architectures or training methods that can leverage data and compute more efficiently, e.g., more efficient transformers, residual networks, batch normalization, self-supervised learning algorithms that can scale to large (ideally unlimited) data (e.g. likelihood-based generative pre-training, contrastive learning).

I think Orhan’s statement is easier to engage with than Sutton’s because he gives concrete examples. His invective is tight.

Orhan’s examples suggest that when data are plentiful, we should use compute-heavy infrastructure to extract as much information as possible.

This doesn’t directly apply everywhere. For example, in my last hydrology project, a single data point cost AUD 700,000 because drilling a thousand-metre well cost that much. Telling me to collect a billion more data points instead of being clever isn’t useful; collapsing the global economy to collect them wouldn’t be clever. There might be ways to get more data, such as pretraining a geospatial foundation model on hydrology-adjacent tasks, but that doesn’t look trivial.

What we generally want isn’t a homily saying being clever is a waste of time, but a trade-off curve quantifying how clever to bother being. That’s probably the best lesson.

8 Trading off against humans

This is partially discussed in economics of LLMs. We still need a general theory, though.

9 Bitter lessons in career strategy

I get the most from the limited compute in my skull by figuring out how to use the larger compute on my GPU.

Career version:

…a lot of the good ideas that did not require a massive compute budget have already been published by smart people who did not have GPUs, so we need to leverage our technological advantage relative to those ancestors if we want to get cited.

10 Information transmission and compression

See epistemic bottlenecks.

11 Thermodynamics of computation

Energy, water and minerals are important inputs to computation. This suggests we should think about the connection between thermodynamics and information and the statistical mechanics of statistics.

12 Incoming

How Well Does RL Scale? — Toby Ord

Now that RL-training is nearing its effective limit, we may have lost the ability to effectively turn more compute into more intelligence.
Will the Need to Retrain AI Models from Scratch Block a Software Intelligence Explosion?
IsoFLOP curves of large language models are extremely flat – Severely Theoretical
A Better Lesson – Rodney Brooks
Compute Goes Brrr: Revisiting Sutton’s Bitter Lesson for Artificial Intelligence
Reflections on ‘The Bitter Lesson’
Gwern
- On GPT-3: Meta-Learning, Scaling, Implications, and Deep Theory
- The Scaling Hypothesis
Scale, schlep, and systems

13 References

Baronchelli, Gong, Puglisi, et al. 2010. “Modeling the emergence of universality in color naming patterns.” Proceedings of the National Academy of Sciences of the United States of America.

Bentley. 2025. “Knowing You Know Nothing in the Age of Generative AI.” Humanities and Social Sciences Communications.

Cancho, and Solé. 2003. “Least Effort and the Origins of Scaling in Human Language.” Proceedings of the National Academy of Sciences.

Cao, Lazaridou, Lanctot, et al. 2018. “Emergent Communication Through Negotiation.”

Chaabouni, Kharitonov, Dupoux, et al. 2019. “Anti-Efficient Encoding in Emergent Communication.” In Advances in Neural Information Processing Systems.

———, et al. 2021. “Communicating Artificial Neural Networks Develop Efficient Color-Naming Systems.” Proceedings of the National Academy of Sciences.

Christiansen, and Chater. 2008. “Language as Shaped by the Brain.” Behavioral and Brain Sciences.

Corominas-Murtra, and Solé. 2010. “Universality of Zipf’s Law.” Physical Review E.

Dasgupta, Schulz, Tenenbaum, et al. 2020. “A Theory of Learning to Infer.” Psychological Review.

Falandays, Kaaronen, Moser, et al. 2022. “All Intelligence Is Collective Intelligence.”

Galesic, Barkoczi, Berdahl, et al. 2022. “Beyond Collective Intelligence: Collective Adaptation.”

Ganguly, Jain, and Watchareeruetai. 2023. “Amortized Variational Inference: A Systematic Review.” Journal of Artificial Intelligence Research.

Gershman. n.d. “Amortized Inference in Probabilistic Reasoning.”

Gozli. 2023. “Principles of Categorization: A Synthesis.” Seeds of Science.

Grace. 2013. “Algorithmic Progress in Six Domains.”

Havrylov, and Titov. 2017. “Emergence of Language with Multi-Agent Games: Learning to Communicate with Sequences of Symbols.”

Hernandez, and Brown. 2020. “Measuring the Algorithmic Efficiency of Neural Networks.”

Hooker. 2020. “The Hardware Lottery.” arXiv:2009.06489 [Cs].

Hu, Jain, Elmoznino, et al. 2023. “Amortizing Intractable Inference in Large Language Models.” In.

Jiang, and Lu. 2018. “Learning Attentional Communication for Multi-Agent Cooperation.” In Advances in Neural Information Processing Systems.

Kaddour, Lynch, Liu, et al. 2022. “Causal Machine Learning: A Survey and Open Problems.”

Lanier. 2010. You Are Not a Gadget: A Manifesto.

Lian, Bisazza, and Verhoef. 2021. “The Effect of Efficient Messaging and Input Variability on Neural-Agent Iterated Language Learning.”

Loreto, Mukherjee, and Tria. 2012. “On the Origin of the Hierarchy of Color Names.” Proceedings of the National Academy of Sciences of the United States of America.

Lowe, Foerster, Boureau, et al. 2019. “On the Pitfalls of Measuring Emergent Communication.”

Ma, Lewis, and Kleijn. 2020. “The HSIC Bottleneck: Deep Learning Without Back-Propagation.” Proceedings of the AAAI Conference on Artificial Intelligence.

Margossian, and Blei. 2024. “Amortized Variational Inference: When and Why?” In.

O’Connor. 2017. “Evolving to Generalize: Trading Precision for Speed.” British Journal for the Philosophy of Science.

Petersson, Folia, and Hagoort. 2012. “What Artificial Grammar Learning Reveals about the Neurobiology of Syntax.” Brain and Language, The Neurobiology of Syntax,.

Peysakhovich, and Lerer. 2017. “Prosocial Learning Agents Solve Generalized Stag Hunts Better Than Selfish Ones.”

Resnick, Gupta, Foerster, et al. 2020. “Capacity, Bandwidth, and Compositionality in Emergent Language Learning.”

Smith. 2022. The Internet Is Not What You Think It Is: A History, a Philosophy, a Warning.

Spufford. 2012. Red Plenty.

Steyvers, and Tenenbaum. 2005. “The Large-Scale Structure of Semantic Networks: Statistical Analyses and a Model of Semantic Growth.” Cognitive Science.

Togelius, and Yannakakis. 2023. “Choose Your Weapon: Survival Strategies for Depressed AI Academics.”

Tucker, Li, Agrawal, et al. 2021. “Emergent Discrete Communication in Semantic Spaces.” In Advances in Neural Information Processing Systems.

Weisbuch, Deffuant, Amblard, et al. 2002. “Meet, Discuss, and Segregate!” Complexity.

Zammit-Mangion, Sainsbury-Dale, and Huser. 2024. “Neural Methods for Amortized Inference.”