Operationalising the bitter lessons in compute and cleverness
Amortizing the cost of being smart
2022-01-14 — 2026-04-21
Wherein existing formalisms — among them rational metareasoning and resource rationality — are surveyed as partial accounts of how training-time and inference-time compute may be substituted.
What to compute, and when, to make inferences about the world most efficiently.
A lot of ML problems, in hindsight, are implicitly about when to spend our compute budget. Scaling curves tell us we can spend a predictably large amount of compute at training time and amortise it over cheap inference. Sutton’s folk-wisdom bitter lesson says that investing in compute tends to outperform investing in clever algorithms. The economics of AI/labour substitution ask in what sense compute is a kind of labour. I think it’s useful to group the normative question of doing this well under the heading of when to compute.
There are, it turns out, several existing research programmes that formalise this intuition — more than I initially realised, AI-research-assistants-be-praised. I do not think any of them yet composes into the theory I want, but they get further than “some unvarnished thoughts,” so I should acknowledge them before going further.
1 Intelligence as resource scarcity
Pei Wang’s definition of intelligence in his NARS programme (Wang 2022, 1999) proposes that intelligence is the ability of an information-processing system to adapt to its environment while working with insufficient knowledge and resources. The “Assumption of Insufficient Knowledge and Resources” (AIKR) is foundational to his notion of intelligence. If we had sufficient knowledge and compute, we wouldn’t need intelligence — we’d just look up the answer. Intelligence is the phenomenon that arises from doing inference under resource scarcity.
Compare and contrast with, e.g. AIXI. Hutter’s AIXI defines an ideal agent: one that considers every computable hypothesis, weighted by Kolmogorov complexity, and picks the action maximising expected future reward. It’s a beautiful formalisation of what intelligence would be if resources were unlimited — and it’s incomputable, by construction. The practical response has been to approximate: AIXItl bounds the computation time and hypothesis depth, trading optimality for tractability. But the approximation is bolted on after the fact; the theory itself lives in a world without scarcity.
Wang starts from the other end. AIKR says: we never have enough knowledge, we never have enough time, and the problems arrive without warning. Intelligence isn’t what remains after we bound an ideal agent; it’s what emerges because the agent was always bounded. An AIXI-derived agent degrades from perfection. An AIKR agent is designed from the ground up to satisfice — distributing its time–space budget across competing tasks according to their relative priority, where no solution is ever final.1
AIKR is still too nebulous for my tastes. It tells us intelligence is the resource-allocation problem, but doesn’t give us the production functions or substitution curves between, say, training compute and inference compute. For that we need something more economic.
2 Rational metareasoning and resource rationality
Here are some things I hadn’t heard about until 2 hours ago!
Stuart Russell and Eric Wefald’s rational metareasoning (Russell and Wefald 1991) models each computation step as a decision: the expected value of performing that computation, minus its cost. Intelligence, in this framing, is choosing which inferences to run — literally “when to compute” as a decision theory. Hay, Russell, Tolpin & Shimony (Hay et al. 2012) later operationalised this as a meta-level MDP, making the framework tractable enough to implement, apparently. I would want to get my hands dirty with this before I felt comfortable claiming to understand it.
Griffiths, Lieder & Icard’s resource-rational analysis (Lieder and Griffiths 2020) takes a different angle from cognitive science: human cognitive biases aren’t irrational; they’re optimal given finite compute. The brain is doing the best it can with the budget it has. This turns the bitter lesson from an empirical observation into a normative theory — we should expect any intelligent system to use heuristics, caching, and shortcuts, because those are the resource-rational thing to do. Zilberstein’s anytime algorithms (Zilberstein 1996) look a bit like an instrumentalisation: algorithms that return progressively better answers as we give them more time, with the agent deciding when to stop.
3 At scale
The above ideas don’t directly address the economic structure I care about — the substitution between training and inference, the amortisation of a foundation model across millions of users, the trade-off between data acquisition cost and model complexity. That seems to require something genuinely economic, not just decision-theoretic.
How much can one kind of computation substitute for another? It might be something “deeper” like a fundamental theory of intelligence, but pure economics looks to me like an easier nut to crack. Foundation models, in this light, are an ingenious amortisation strategy — maybe an interesting financial instrument denominated in the currency of “cognition,” or maybe just a classic production cost curve with unusual returns to scale.
What I think is still missing: a theory that connects Wang’s AIKR (intelligence is resource scarcity), Russell’s metareasoning (each computation is a decision), and the actual microeconomics of ML systems (training vs. inference, data vs. compute, memorisation vs. extrapolation). I would like it even to include the computation that happens within the human skull. For now, some more specific notes on the pieces.
4 Amortisation
Amortisation is what Bayesians call the trade-off involved in learning inferential shortcuts; there is a whole parallel literature on this in probabilistic NNs and variational inference.
It doesn’t presume Bayes though. Case in point: RL. We can spend a lot of compute at training time to learn a policy that is very cheap to execute at inference time. In many RL algorithms, notably ones trained on pure physical models, this is a naked attempt to speed up a thing that we could already calculate. We could in principle calculate the optimal action at each time step by solving a dynamic programming problem, but that is often computationally intractable. Instead, we learn a policy that approximates the optimal action — amortising the cost of dynamic programming into a single forward pass.
RLHF adds a second compute budget on top of pretraining, and Toby Ord argued that RL-training was nearing its effective limit: “we may have lost the ability to effectively turn more compute into more intelligence.” That doesn’t seem to have tanked progress, though, so I suspect something else is going on.
TODO: training vs. inference substitution more generally; memorisation as a form of amortisation.
5 The bitter lesson and its discontents
Sutton’s famous bitter lesson set the terminology:
The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.
We make this point a lot — for example, On the futility of trying to be clever (the bitter lesson redux).
Alternative phrasing: Even the best human minds aren’t very good; the quickest path to intelligence prioritises replacing that human bottleneck.
Things work better when they can scale up. We should deploy the compute we have effectively.
I do think that there are many interesting research directions consistent with the philosophy of the bitter lesson that can have more meaningful, longer-term impact than small algorithmic or architectural tweaks. I just want to wrap up this post by giving a couple of examples below:
Probing the limits of model capabilities as a function of training data size: can we get to something close to human-level machine translation by leveraging everything multi-lingual on the web (I think we’ve learned that the answer to this is basically yes)? Can we get to something similar to human-level language understanding by scaling up the GPT-3 approach a couple of orders of magnitude (I think the answer to this is probably we don’t know yet)? Of course, smaller scale, less ambitious versions of these questions are also incredibly interesting and important.
Finding out what we can learn from different kinds of data and how what we learn differs as a function of this: e.g. learning from raw video data vs. learning from multi-modal data received by an embodied agent interacting with the world; learning from pure text vs. learning from text + images or text + video.
Coming up with new model architectures or training methods that can leverage data and compute more efficiently, e.g., more efficient transformers, residual networks, batch normalization, self-supervised learning algorithms that can scale to large (ideally unlimited) data (e.g. likelihood-based generative pre-training, contrastive learning).
I think Orhan’s statement is easier to engage with than Sutton’s because he gives concrete examples. His invective is tight.
Orhan’s examples suggest that when data are plentiful, we should use compute-heavy infrastructure to extract as much information as possible. And we’re getting very good at it — algorithmic progress has yielded more gains than classical hardware efficiency (Grace 2013). The AI and efficiency analysis (Hernandez and Brown 2020) finds that since 2012, the compute needed to train to AlexNet-level ImageNet performance has halved every 16 months (Moore’s Law would give only 11× over the same period).
This doesn’t directly apply everywhere. For example, in my last hydrology project, a single data point cost AUD 700,000 because drilling a thousand-metre well cost that much. Telling me to collect a billion more data points instead of being clever isn’t useful; collapsing the global economy to collect them wouldn’t be clever either. There might be ways to get more data, such as pretraining a geospatial foundation model on hydrology-adjacent tasks, but that doesn’t look trivial.
What we generally want isn’t a homily saying being clever is a waste of time, but a trade-off curve quantifying how clever to bother being. That’s probably the best lesson.
6 Bitter lessons in career strategy
I get the most from the limited compute in my skull by figuring out how to use the larger compute on my GPU.
Career version:
…a lot of the good ideas that did not require a massive compute budget have already been published by smart people who did not have GPUs, so we need to leverage our technological advantage relative to those ancestors if we want to get cited.
7 Incoming
Will the Need to Retrain AI Models from Scratch Block a Software Intelligence Explosion?
IsoFLOP curves of large language models are extremely flat – Severely Theoretical
Compute Goes Brrr: Revisiting Sutton’s Bitter Lesson for Artificial Intelligence
Gwern
Trading off compute against humans — partially in economics of LLMs
Epistemic bottlenecks — information transmission and compression
Thermodynamics of computation: material costs, statistical mechanics of statistics
8 References
Footnotes
Wang explicitly acknowledges the lineage to Simon’s bounded rationality but argues AIKR is more concrete and more restrictive.↩︎

