Operationalising the bitter lessons in compute and cleverness
Amortizing the cost of being smart
2021-01-14 — 2025-10-23
Wherein the foundational logic of it is argued that massive training compute versus inference time compute and other amortization techniques are examined, and a research program to formalise the economics of cognition is proposed.
What to compute, and when, to make inferences about the world most efficiently.
Here are some observations.
- scaling curves of LLMs suggest that we can spend a predictably large amount of compute at training time and do a lot of useful work at inference time
- classic amortization methods in neural networks suggest that we can trade off between training and inference costs by carefully designing our model architecture and training procedure.
- Sutton’s folk-wisdom bitter lesson suggests that, in the long run, investing in compute tends to outperform investing in clever algorithms.
- the economics of AI/labour substitution lead us to ponder in what sense compute is a kind of labour and how to trade off different kinds of compute
I wonder whether there is a general economic model here, that incorporates the value of information and the cost of compute. If so, I think it would also need to incorporate the notions of memorisation and extrapolation, and trade off data, computational complexity, and similar factors. To me, this looks like a fundamentally economic theory. How much can one kind of computation substitute for another? It might instead be something “deeper” like a fundamental theory of intelligence, but pure economics looks to me like an easier nut to crack.
In that light, foundation models represent an ingenious amortization strategy in the classical sense. Maybe they’re in fact an interesting financial instrument in the currency of “cognition”? Or maybe we should think about more classic production cost curves?
I’m not sure, but intuitively it seems there are some interesting models here.
I’m not selling a comprehensive theory in this blog post, not yet. But if I can squeeze out time to research this stuff, maybe I can develop one. For now, just some unvarnished thoughts on the interplay between computation, information, and the economics of machine learning.
1 Word salad
Overparameterization and inductive biases. Operationalizing the scaling hypothesis. Compute overhangs. Tokenomics. Amortization of inference. Reduced order modelling. Training time versus inference time. The business model of FLOPS.
Concretely, traditional statistics was greatly upset when it turned out that labouriously proving things was not as great as training a Neural Network to do the same thing. The field was upset again when it turned out that Foundation models and LLMs make it even weirder: it turns out to be worthwhile spending a lot of money training a gigantic model on all the data you can get, and then using that model to do very cheap inference. Essentially you can pay once for a very expensive training run and thereafter everything is cheap.
A lot of ML problems are in hindsight implicitly about this question of when to spend our compute budget. I think it’s useful to group the normative question of doing this well under the heading of when to compute.
2 Historical background
Sutton’s famous bitter lesson set the terminology:
The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.
Lots of people make this point, e.g. On the futility of trying to be clever (the bitter lesson redux).
Alternative phrasing: Even the best human mind isn’t very good, so the quickest path to intelligence is the one that prioritizes replacing the human bottleneck.
Things are better when they can scale up. The idea is that we should deploy the compute we have effectively.
I do think that there are many interesting research directions consistent with the philosophy of the bitter lesson that can have more meaningful, longer-term impact than small algorithmic or architectural tweaks. I just want to wrap up this post by giving a couple of examples below:
Probing the limits of model capabilities as a function of training data size: can we get to something close to human-level machine translation by leveraging everything multi-lingual on the web (I think we’ve learned that the answer to this is basically yes)? Can we get to something similar to human-level language understanding by scaling up the GPT-3 approach a couple of orders of magnitude (I think the answer to this is probably we don’t know yet)? Of course, smaller scale, less ambitious versions of these questions are also incredibly interesting and important.
Finding out what we can learn from different kinds of data and how what we learn differs as a function of this: e.g. learning from raw video data vs. learning from multi-modal data received by an embodied agent interacting with the world; learning from pure text vs. learning from text + images or text + video.
Coming up with new model architectures or training methods that can leverage data and compute more efficiently, e.g., more efficient transformers, residual networks, batch normalization, self-supervised learning algorithms that can scale to large (ideally unlimited) data (e.g. likelihood-based generative pre-training, contrastive learning).
I think Orhan’s statement is easier to engage with than Sutton’s because Orhan provides concrete examples. Plus his invective is tight.
Orhan’s examples suggest that when data is plentiful, we should figure out how to use compute-heavy infrastructure to extract all the information.
This doesn’t directly apply everywhere. For example, in hydrology a single data point in my last project cost AUD700,000 because drilling a thousand-metre well costs that much. Telling me to collect a billion more data points instead of being clever isn’t useful; collapsing the global economy to collect them wouldn’t be clever. There might be ways to get more data, such as pretraining a geospatial foundation model on hydrology-adjacent tasks, but that doesn’t look trivial.
What we generally want isn’t a homily saying being clever is a waste of time, but rather a trade-off curve quantifying how clever to bother being. That’s probably the best lesson.
3 Thermodynamics of computation
Energy, water and minerals, etc., are important inputs to computation. This suggests we should think about the connection between thermodynamics and information and the statistical mechanics of statistics.
4 General compute efficiency
We’re getting very good at using hardware efficiently (Grace 2013). The piece AI and efficiency (Hernandez and Brown 2020) makes this clear:
We’re releasing an analysis showing that since 2012 the amount of compute needed to train a neural net to the same performance on ImageNet classification has been decreasing by a factor of 2 every 16 months. Compared to 2012, it now takes 44 times less compute to train a neural network to the level of AlexNet (by contrast, Moore’s Law would yield an 11x cost improvement over this period). Our results suggest that for AI tasks with high levels of recent investment, algorithmic progress has yielded more gains than classical hardware efficiency.
5 Amortisation
Bayesians call this the trade-off of learning inferential shortcuts. TBC
6 Computing and remembering
TBD. For now, see NN memorization.
7 Trading off against humans
Partially discussed in economics of LLMs. We still need a general theory, though.
8 Bitter lessons in career strategy
The best way to spend the limited compute in my skull is to figure out how to use the larger compute on my GPU.
Career version:
…a lot of the good ideas that did not require a massive compute budget have already been published by smart people who did not have GPUs, so we need to leverage our technological advantage relative to those ancestors if we want to get cited.