Operationalising the bitter lessons in compute and cleverness

Amortizing the cost of being smart

2022-01-14 — 2026-04-21

Wherein existing formalisms — among them rational metareasoning and resource rationality — are surveyed as partial accounts of how training-time and inference-time compute may be substituted.

bounded compute
functional analysis
machine learning
model selection
optimization
statmech
when to compute
Figure 1

What to compute, and when, to make inferences about the world most efficiently.

A lot of ML problems, in hindsight, are implicitly about when to spend our compute budget. Scaling curves tell us we can spend a predictably large amount of compute at training time and amortise it over cheap inference. Sutton’s folk-wisdom bitter lesson says that investing in compute tends to outperform investing in clever algorithms. The economics of AI/labour substitution ask in what sense compute is a kind of labour. I think it’s useful to group the normative question of doing this well under the heading of when to compute.

There are, it turns out, several existing research programmes that formalise this intuition — more than I initially realised, AI-research-assistants-be-praised. I do not think any of them yet composes into the theory I want, but they get further than “some unvarnished thoughts,” so I should acknowledge them before going further.

1 Intelligence as resource scarcity

Pei Wang’s definition of intelligence in his NARS programme (Wang 2022, 1999) proposes that intelligence is the ability of an information-processing system to adapt to its environment while working with insufficient knowledge and resources. The “Assumption of Insufficient Knowledge and Resources” (AIKR) is foundational to his notion of intelligence. If we had sufficient knowledge and compute, we wouldn’t need intelligence — we’d just look up the answer. Intelligence is the phenomenon that arises from doing inference under resource scarcity.

Compare and contrast with, e.g. AIXI. Hutter’s AIXI defines an ideal agent: one that considers every computable hypothesis, weighted by Kolmogorov complexity, and picks the action maximising expected future reward. It’s a beautiful formalisation of what intelligence would be if resources were unlimited — and it’s incomputable, by construction. The practical response has been to approximate: AIXItl bounds the computation time and hypothesis depth, trading optimality for tractability. But the approximation is bolted on after the fact; the theory itself lives in a world without scarcity.

Wang starts from the other end. AIKR says: we never have enough knowledge, we never have enough time, and the problems arrive without warning. Intelligence isn’t what remains after we bound an ideal agent; it’s what emerges because the agent was always bounded. An AIXI-derived agent degrades from perfection. An AIKR agent is designed from the ground up to satisfice — distributing its time–space budget across competing tasks according to their relative priority, where no solution is ever final.1

AIKR is still too nebulous for my tastes. It tells us intelligence is the resource-allocation problem, but doesn’t give us the production functions or substitution curves between, say, training compute and inference compute. For that we need something more economic.

2 Rational metareasoning and resource rationality

Here are some things I hadn’t heard about until 2 hours ago!

Stuart Russell and Eric Wefald’s rational metareasoning (Russell and Wefald 1991) models each computation step as a decision: the expected value of performing that computation, minus its cost. Intelligence, in this framing, is choosing which inferences to run — literally “when to compute” as a decision theory. Hay, Russell, Tolpin & Shimony (Hay et al. 2012) later operationalised this as a meta-level MDP, making the framework tractable enough to implement, apparently. I would want to get my hands dirty with this before I felt comfortable claiming to understand it.

Griffiths, Lieder & Icard’s resource-rational analysis (Lieder and Griffiths 2020) takes a different angle from cognitive science: human cognitive biases aren’t irrational; they’re optimal given finite compute. The brain is doing the best it can with the budget it has. This turns the bitter lesson from an empirical observation into a normative theory — we should expect any intelligent system to use heuristics, caching, and shortcuts, because those are the resource-rational thing to do. Zilberstein’s anytime algorithms (Zilberstein 1996) look a bit like an instrumentalisation: algorithms that return progressively better answers as we give them more time, with the agent deciding when to stop.

3 At scale

The above ideas don’t directly address the economic structure I care about — the substitution between training and inference, the amortisation of a foundation model across millions of users, the trade-off between data acquisition cost and model complexity. That seems to require something genuinely economic, not just decision-theoretic.

How much can one kind of computation substitute for another? It might be something “deeper” like a fundamental theory of intelligence, but pure economics looks to me like an easier nut to crack. Foundation models, in this light, are an ingenious amortisation strategy — maybe an interesting financial instrument denominated in the currency of “cognition,” or maybe just a classic production cost curve with unusual returns to scale.

What I think is still missing: a theory that connects Wang’s AIKR (intelligence is resource scarcity), Russell’s metareasoning (each computation is a decision), and the actual microeconomics of ML systems (training vs. inference, data vs. compute, memorisation vs. extrapolation). I would like it even to include the computation that happens within the human skull. For now, some more specific notes on the pieces.

4 Amortisation

Amortisation is what Bayesians call the trade-off involved in learning inferential shortcuts; there is a whole parallel literature on this in probabilistic NNs and variational inference.

It doesn’t presume Bayes though. Case in point: RL. We can spend a lot of compute at training time to learn a policy that is very cheap to execute at inference time. In many RL algorithms, notably ones trained on pure physical models, this is a naked attempt to speed up a thing that we could already calculate. We could in principle calculate the optimal action at each time step by solving a dynamic programming problem, but that is often computationally intractable. Instead, we learn a policy that approximates the optimal action — amortising the cost of dynamic programming into a single forward pass.

RLHF adds a second compute budget on top of pretraining, and Toby Ord argued that RL-training was nearing its effective limit: “we may have lost the ability to effectively turn more compute into more intelligence.” That doesn’t seem to have tanked progress, though, so I suspect something else is going on.

TODO: training vs. inference substitution more generally; memorisation as a form of amortisation.

5 The bitter lesson and its discontents

Sutton’s famous bitter lesson set the terminology:

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.

We make this point a lot — for example, On the futility of trying to be clever (the bitter lesson redux).

Alternative phrasing: Even the best human minds aren’t very good; the quickest path to intelligence prioritises replacing that human bottleneck.

Things work better when they can scale up. We should deploy the compute we have effectively.

Figure 2: Via Gwern; can no longer find original link because Twitter is being weird

Ermin Orhan argues:

I do think that there are many interesting research directions consistent with the philosophy of the bitter lesson that can have more meaningful, longer-term impact than small algorithmic or architectural tweaks. I just want to wrap up this post by giving a couple of examples below:

  1. Probing the limits of model capabilities as a function of training data size: can we get to something close to human-level machine translation by leveraging everything multi-lingual on the web (I think we’ve learned that the answer to this is basically yes)? Can we get to something similar to human-level language understanding by scaling up the GPT-3 approach a couple of orders of magnitude (I think the answer to this is probably we don’t know yet)? Of course, smaller scale, less ambitious versions of these questions are also incredibly interesting and important.

  2. Finding out what we can learn from different kinds of data and how what we learn differs as a function of this: e.g. learning from raw video data vs. learning from multi-modal data received by an embodied agent interacting with the world; learning from pure text vs. learning from text + images or text + video.

  3. Coming up with new model architectures or training methods that can leverage data and compute more efficiently, e.g., more efficient transformers, residual networks, batch normalization, self-supervised learning algorithms that can scale to large (ideally unlimited) data (e.g. likelihood-based generative pre-training, contrastive learning).

I think Orhan’s statement is easier to engage with than Sutton’s because he gives concrete examples. His invective is tight.

Orhan’s examples suggest that when data are plentiful, we should use compute-heavy infrastructure to extract as much information as possible. And we’re getting very good at it — algorithmic progress has yielded more gains than classical hardware efficiency (Grace 2013). The AI and efficiency analysis (Hernandez and Brown 2020) finds that since 2012, the compute needed to train to AlexNet-level ImageNet performance has halved every 16 months (Moore’s Law would give only 11× over the same period).

This doesn’t directly apply everywhere. For example, in my last hydrology project, a single data point cost AUD 700,000 because drilling a thousand-metre well cost that much. Telling me to collect a billion more data points instead of being clever isn’t useful; collapsing the global economy to collect them wouldn’t be clever either. There might be ways to get more data, such as pretraining a geospatial foundation model on hydrology-adjacent tasks, but that doesn’t look trivial.

What we generally want isn’t a homily saying being clever is a waste of time, but a trade-off curve quantifying how clever to bother being. That’s probably the best lesson.

6 Bitter lessons in career strategy

I get the most from the limited compute in my skull by figuring out how to use the larger compute on my GPU.

Career version:

…a lot of the good ideas that did not require a massive compute budget have already been published by smart people who did not have GPUs, so we need to leverage our technological advantage relative to those ancestors if we want to get cited.

7 Incoming

8 References

Beaulieu, Frati, Miconi, et al. 2020. Learning to Continually Learn.”
Cancho, and Solé. 2003. Least Effort and the Origins of Scaling in Human Language.” Proceedings of the National Academy of Sciences.
Clune. 2020. AI-GAs: AI-Generating Algorithms, an Alternate Paradigm for Producing General Artificial Intelligence.”
Cully, Clune, Tarapore, et al. 2015. Robots that can adapt like animals.” Nature.
Dasgupta, Schulz, Tenenbaum, et al. 2020. A Theory of Learning to Infer. Psychological Review.
Falandays, Kaaronen, Moser, et al. 2022. All Intelligence Is Collective Intelligence.”
Finn, Abbeel, and Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks.” In Proceedings of the 34th International Conference on Machine Learning.
Ganguly, Jain, and Watchareeruetai. 2023. Amortized Variational Inference: A Systematic Review.” Journal of Artificial Intelligence Research.
Gershman. n.d. “Amortized Inference in Probabilistic Reasoning.”
Grace. 2013. Algorithmic Progress in Six Domains.”
Hay, Russell, Tolpin, et al. 2012. “Selecting Computations: Theory and Applications.” In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence.
Hernandez, and Brown. 2020. Measuring the Algorithmic Efficiency of Neural Networks.”
Hoffman, and Prakash. 2014. Objects of consciousness.” Frontiers in Psychology.
Hooker. 2020. The Hardware Lottery.” arXiv:2009.06489 [Cs].
Hu, Jain, Elmoznino, et al. 2023. Amortizing Intractable Inference in Large Language Models.” In.
Ionescu, Frisch, Farghly, et al. 2025. Cognitive Infrastructures.” Antikythera Digital Journal.
Lang, Fisher, Mora, et al. 2014. Thermodynamics of Statistical Inference by Cells.” Physical Review Letters.
Levin. 2024. Artificial Intelligences: A Bridge Toward Diverse Intelligence and Humanity’s Future.” Advanced Intelligent Systems.
Levine, Chater, Tenenbaum, et al. 2024. Resource-Rational Contractualism: A Triple Theory of Moral Cognition.” Behavioral and Brain Sciences.
Lieder, and Griffiths. 2020. Resource-Rational Analysis: Understanding Human Cognition as the Optimal Use of Limited Computational Resources.” Behavioral and Brain Sciences.
Lowe, Foerster, Boureau, et al. 2019. On the Pitfalls of Measuring Emergent Communication.”
Margossian, and Blei. 2024. Amortized Variational Inference: When and Why? In.
Marsland, and England. 2018. Limits of Predictions in Thermodynamic Systems: A Review.” Reports on Progress in Physics.
O’Connor. 2017. Evolving to Generalize: Trading Precision for Speed.” British Journal for the Philosophy of Science.
Ortega, and Braun. 2013. Thermodynamics as a Theory of Decision-Making with Information-Processing Costs.” Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.
Perunov, Marsland, and England. 2016. Statistical Physics of Adaptation.” Physical Review X.
Petersson, Folia, and Hagoort. 2012. What Artificial Grammar Learning Reveals about the Neurobiology of Syntax.” Brain and Language, The Neurobiology of Syntax,.
Resnick, Gupta, Foerster, et al. 2020. Capacity, Bandwidth, and Compositionality in Emergent Language Learning.”
Ringstrom. 2023. Reward Is Not Necessary: How to Create a Modular & Compositional Self-Preserving Agent for Life-Long Learning.”
Russell, and Wefald. 1991. Principles of Metareasoning.” Artificial Intelligence.
Schneider, and Kay. 1994. “Life as a Manifestation of the Second Law of Thermodynamics.” Mathematical and Computer Modelling.
Shin, Price, Wolpert, et al. 2020. Scale and Information-Processing Thresholds in Holocene Social Evolution.” Nature Communications.
Spufford. 2012. Red Plenty.
Steyvers, and Tenenbaum. 2005. The Large-Scale Structure of Semantic Networks: Statistical Analyses and a Model of Semantic Growth.” Cognitive Science.
Still, Sivak, Bell, et al. 2012. Thermodynamics of Prediction.” Physical Review Letters.
Stinchcombe. 1990. Information and Organizations: Volume 19.
Togelius, and Yannakakis. 2023. Choose Your Weapon: Survival Strategies for Depressed AI Academics.”
Wang. 1999. “On the Working Definition of Intelligence.”
———. 2022. Intelligence: From Definition to Design.” In Proceedings of the Third International Workshop on Self-Supervised Learning.
Weisbuch, Deffuant, Amblard, et al. 2002. Meet, Discuss, and Segregate! Complexity.
Wolpert, David H. 2008. Physical Limits of Inference.” Physica D: Nonlinear Phenomena, Novel Computing Paradigms: Quo Vadis?,.
———. 2018. Theories of Knowledge and Theories of Everything.” In The Map and the Territory: Exploring the Foundations of Science, Thought and Reality. The Frontiers Collection.
Wolpert, David H., and Harper. 2025. The Computational Power of a Human Society: A New Model of Social Evolution.”
Wolpert, David H, and Korbel. 2026. What Does It Mean for a System to Compute? Journal of Physics: Complexity.
Xu, Zhao, Song, et al. 2019. A Theory of Usable Information Under Computational Constraints.” In.
Zammit-Mangion, Sainsbury-Dale, and Huser. 2024. Neural Methods for Amortized Inference.”
Zhao, Li, Zhang, et al. 2025. Curious Causality-Seeking Agents Learn Meta Causal World.” In.
Zilberstein. 1996. Using Anytime Algorithms in Intelligent Systems.” AI Magazine.

Footnotes

  1. Wang explicitly acknowledges the lineage to Simon’s bounded rationality but argues AIKR is more concrete and more restrictive.↩︎