Unreasonable effectiveness of empirical risk minimization
General methods that leverage computation are ones with scalar loss functions and a static data distribution are ultimately the most profitable, and by a large margin
2026-06-24 — 2026-06-25
Wherein the Tension Between the Tractability of Empirical Risk Minimisation and the Open-Ended, Path-Dependent Nature of Human Life Is Examined, With Scalar Loss Functions Found to Impose a Static Distribution Upon an Irreducibly Contingent World.
Very WIP 🚧TODO🚧
Loss functions! The engine of modern machine learning.
Recall that modern machine learning is built around loss functions, and they are in practice scalar valuations of the badness of a model’s output. Formally, in standard machine learning we usually assume data points \((x, y)\) are drawn i.i.d. from some fixed but unknown distribution \(P\) over input-label pairs, where the covariates or predictors \(x\) are real vectors in \(\mathbb{R}^d\) for some fixed, known \(d\).
A loss function \(\ell\) is then a map from predictions and observed targets/labels to some real number that tells us “how bad” the prediction was:
\[\ell: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}\]
Given a sample \((x, y) \sim P\), where \(x \in \mathbb{R}^d\) is the input vector, \(y \in \mathcal{Y}\) is the true label, and \(\hat{y} = h_\theta(x) \in \mathcal{Y}\) is the model’s prediction under parameters \(\theta\), the loss at a single example is:
\[\ell(y, h_\theta(x)) \in \mathbb{R}\]
The quantity we actually care about is the expected risk under the data-generating distribution:
\[R(\theta) = \mathbb{E}_{(x,y) \sim P}[\ell(y, h_\theta(x))]\]
Since \(P\) is unknown, we replace it with a finite sample \(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n\) and minimise the empirical risk:
\[L(\theta) = \frac{1}{n} \sum_{i=1}^n \ell(y_i, h_\theta(x_i))\]
This is the finite-sample approximation to expected risk minimisation, usually called empirical risk minimisation (ERM). The choice of \(\ell\) encodes our assumptions about the noise model and the cost of errors.
There are a few things to note about these bad boys
- we are, as a civilisation, really good at choosing \(\theta\) to minimise \(L(\theta)\), even in high dimensions, even for horrible \(P\), even for very weird data, thanks to gradient-based optimisation and stochastic approximations.
- We generally choose \(\ell\) for computational convenience, not because it is a perfect reflection of the true cost of errors in the real world, which is part of why we are so good at #1
- Notice we assumed that the data distribution \(P\) is fixed and static. This is a strong assumption. Sometimes we can make it nearly true by choosing static pieces of the world to model, or conditioning in some clever way, or changing the world to make it true but in general it is not very true and we jus’ frontin’.
- If you can approximate your real problem with an optimisation like this, you generally win, in the sense of shipping fancy products fast.
This notebook is about that pressure to remake the world into parts that can be well-described by static distributions with scalar loss functions, and to wonder what that does to us.
1 Let us release some woo pressure
Point of order: I am not here to pitch you on the more mystical arguments against ever conflating a human being with a number. Sometimes quantification and empirical risk minimisation works. Quantitative methods are real methods. Measuring the population is a great way to work out if we have enough food for the population. Predicting the weather is a great way to work out if we should take an umbrella.
That this is unreasonably effective, and brings great benefit is, in a sense, the “problem” I want to address here. So many things can be made to go via ERM, so many things now depend upon it, that I wonder what the case is for the things for which it does not work.
The most gob-smackingly astonishing place that ERM works, of course, is famously Large Language Models whereminimising the error on next word prediction might yet learn to supplant all human intellectual labour. I am still astonished that we live in the transformers-work world.
2 Relaxations
Various of the ERM assumptions have been relaxed in diverse ways across modern AI infrastructure. This is the local variant of Sutton’s famous bitter lesson about “methods that leverage computation” being the most effective; the way we leverage that computation is to make things look like ERM.
- We don’t always take the distribution \(P\) as fixed. For example, online methods learn something like a “recently good” model, and are happy to forget older things
- We let the model range over more interesting spaces than a fixed-size input-output distribution. For example LLMs do this, predicting whole conversations. This is because they ingeniously decompose the sentence prediction problem into smaller, manageable components, allowing them to handle complex, structured data.
- We let the models be more than passive predictors. For example, they can learn to choose actions, in, e.g. reinforcement learning or Bayesian optimisation.
These can get ERM to do surprising things; still within, I argue, basically the same formalism, with all the unearned tractability that gives us.
RL is super interesting. Sutton famously argued very hard that minimising scalar losses was fine for producing general agents Silver et al. (2021). This was controversial; there have been various responses to this, for example, making the reward at least vector-valued (Vamplew et al. 2022).1
3 Against utility
As I am prone to rant to strangers at the bus-stop, a cognitive danger of machine learning is reification of utility functions. Utility functions are an interesting method of analysis, but clearly maladapted for some purposes.
For one thing, if we look at their usage in economics, utilities are induced valuations over allocations of goods.
It is one thing to note that humans don’t think this way. It is another to note that also the world does not even operate this way.
My entire life does not cash out in my acquiring a Pareto-optimal allocation of goods subject to the initial endowment. Rather, my life is some complicated exploration of ways of being, learning, acquiring tastes and losing them, changing and burning out, falling in love, fighting my nemesis, building friend groups and families and institutions and ultimately dying — and what the fuck did that just optimise? From what distribution \(P\) was my life drawn? Could it exist even notionally? Was it the same \(P\) as the generation before? The one that comes next?
Human beings strive for, not bundles of goods, but complicated, contingent, path-dependent, interacting situations that they co-create with the world itself. If trading apples for oranges helps along the way, all the better, but this seems only ever an incidental goal.
Elsewhere this rejection of an underlying utility function has been described as a thick model of value, although I won’t do so here, because it is not just the thickness of this concept that I want to consider, but the open-endedness, the contingency.
4 Static distributions and closed systems
LLMs explore an interesting space: they seem to be able to describe an effectively unlimited world of imagination. Their training technology still looks a hell of a lot like ERM.
How do they do this wizardry?
5 Empowerment must concern outcomes outside the data distribution
The minimizers have only interpreted \(\mathcal{D}\)-world in various ways; the point, however, is to change \(P\). Some of this is addressed in open-ended intelligence.
6 Incoming
\[ c=\mathbb{E}_{(x)\sim P}[h_\theta(x)] \]
This should connect to
