Unreasonable effectiveness of empirical risk minimization

General methods that leverage computation are ones with scalar loss functions and a static data distribution are ultimately the most profitable, and by a large margin

2026-06-24 — 2026-06-25

Wherein the Tension Between the Tractability of Empirical Risk Minimisation and the Open-Ended, Path-Dependent Nature of Human Life Is Examined, With Scalar Loss Functions Found to Impose a Static Distribution Upon an Irreducibly Contingent World.

bounded compute

classification

collective knowledge

culture

ethics

incentive mechanisms

machine learning

optimization

sociology

statmech

utility

when to compute

wonk

Very WIP 🚧TODO🚧

Loss functions! The engine of modern machine learning.

Recall that modern machine learning is built around loss functions, and they are in practice scalar valuations of the badness of a model’s output. Formally, in standard machine learning we usually assume data points \((x, y)\) are drawn i.i.d. from some fixed but unknown distribution \(P\) over input-label pairs, where the covariates or predictors \(x\) are real vectors in \(\mathbb{R}^d\) for some fixed, known \(d\).

A loss function \(\ell\) is then a map from predictions and observed targets/labels to some real number that tells us “how bad” the prediction was:

\[\ell: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}\]

Given a sample \((x, y) \sim P\), where \(x \in \mathbb{R}^d\) is the input vector, \(y \in \mathcal{Y}\) is the true label, and \(\hat{y} = h_\theta(x) \in \mathcal{Y}\) is the model’s prediction under parameters \(\theta\), the loss at a single example is:

\[\ell(y, h_\theta(x)) \in \mathbb{R}\]

The quantity we actually care about is the expected risk under the data-generating distribution:

\[R(\theta) = \mathbb{E}_{(x,y) \sim P}[\ell(y, h_\theta(x))]\]

Since \(P\) is unknown, we replace it with a finite sample \(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n\) and minimise the empirical risk:

\[L(\theta) = \frac{1}{n} \sum_{i=1}^n \ell(y_i, h_\theta(x_i))\]

This is the finite-sample approximation to expected risk minimisation, usually called empirical risk minimisation (ERM). The choice of \(\ell\) encodes our assumptions about the noise model and the cost of errors.

There are a few things to note about these bad boys

we are, as a civilisation, really good at choosing \(\theta\) to minimise \(L(\theta)\), even in high dimensions, even for horrible \(P\), even for very weird data, thanks to gradient-based optimisation and stochastic approximations.
We generally choose \(\ell\) for computational convenience, not because it is a perfect reflection of the true cost of errors in the real world, which is part of why we are so good at #1
Notice we assumed that the data distribution \(P\) is fixed and static. This is a strong assumption. Sometimes we can make it nearly true by choosing static pieces of the world to model, or conditioning in some clever way, or changing the world to make it true but in general it is not very true and we jus’ frontin’.
If you can approximate your real problem with an optimisation like this, you generally win, in the sense of shipping fancy products fast.

This notebook is about that pressure to remake the world into parts that can be well-described by static distributions with scalar loss functions, and to wonder what that does to us.

1 Let us release some woo pressure

Point of order: I am not here to pitch you on the more mystical arguments against ever conflating a human being with a number. Sometimes quantification and empirical risk minimisation works. Quantitative methods are real methods. Measuring the population is a great way to work out if we have enough food for the population. Predicting the weather is a great way to work out if we should take an umbrella.

That this is unreasonably effective, and brings great benefit is, in a sense, the “problem” I want to address here. So many things can be made to go via ERM, so many things now depend upon it, that I wonder what the case is for the things for which it does not work.

The most gob-smackingly astonishing place that ERM works, of course, is famously Large Language Models whereminimising the error on next word prediction might yet learn to supplant all human intellectual labour. I am still astonished that we live in the transformers-work world.

2 Relaxations

Various of the ERM assumptions have been relaxed in diverse ways across modern AI infrastructure. This is the local variant of Sutton’s famous bitter lesson about “methods that leverage computation” being the most effective; the way we leverage that computation is to make things look like ERM.

We don’t always take the distribution \(P\) as fixed. For example, online methods learn something like a “recently good” model, and are happy to forget older things
We let the model range over more interesting spaces than a fixed-size input-output distribution. For example LLMs do this, predicting whole conversations. This is because they ingeniously decompose the sentence prediction problem into smaller, manageable components, allowing them to handle complex, structured data.
We let the models be more than passive predictors. For example, they can learn to choose actions, in, e.g. reinforcement learning or Bayesian optimisation.

These can get ERM to do surprising things; still within, I argue, basically the same formalism, with all the unearned tractability that gives us.

RL is super interesting. Sutton famously argued very hard that minimising scalar losses was fine for producing general agents Silver et al. (2021). This was controversial; there have been various responses to this, for example, making the reward at least vector-valued (Vamplew et al. 2022).¹

3 Against utility

As I am prone to rant to strangers at the bus-stop, a cognitive danger of machine learning is reification of utility functions. Utility functions are an interesting method of analysis, but clearly maladapted for some purposes.

For one thing, if we look at their usage in economics, utilities are induced valuations over allocations of goods.

It is one thing to note that humans don’t think this way. It is another to note that also the world does not even operate this way.

My entire life does not cash out in my acquiring a Pareto-optimal allocation of goods subject to the initial endowment. Rather, my life is some complicated exploration of ways of being, learning, acquiring tastes and losing them, changing and burning out, falling in love, fighting my nemesis, building friend groups and families and institutions and ultimately dying — and what the fuck did that just optimise? From what distribution \(P\) was my life drawn? Could it exist even notionally? Was it the same \(P\) as the generation before? The one that comes next?

Human beings strive for, not bundles of goods, but complicated, contingent, path-dependent, interacting situations that they co-create with the world itself. If trading apples for oranges helps along the way, all the better, but this seems only ever an incidental goal.

Elsewhere this rejection of an underlying utility function has been described as a thick model of value, although I won’t do so here, because it is not just the thickness of this concept that I want to consider, but the open-endedness, the contingency.

4 Static distributions and closed systems

LLMs explore an interesting space: they seem to be able to describe an effectively unlimited world of imagination. Their training technology still looks a hell of a lot like ERM.

How do they do this wizardry?

5 Empowerment must concern outcomes outside the data distribution

The minimizers have only interpreted \(\mathcal{D}\)-world in various ways; the point, however, is to change \(P\). Some of this is addressed in open-ended intelligence.

6 Edge cases

mis-specified Bayes — There are totally theorems for open world Bayesian inference, at last in restricted domains. What can we learn from there?
(Generative) Variational Search (Steinberg et al. 2024)
Kevin Kelly’s Latent Space as a New Medium points out that the latent space is extremely large, maybe larger than we can conceive, with his characteristic smooth prose.
Renormalization group seems to be a live agenda for coarse-graining in ML, and suggests some useful approaches to the the problem of unobservable detail.

7 Incoming

\[ c=\mathbb{E}_{(x)\sim P}[h_\theta(x)] \]

This should connect to

8 References

Bergadano. 1991. “The Problem of Induction and Machine Learning.” In Proceedings of the 12th International Joint Conference on Artificial Intelligence-Volume 2.

boyd. 2023. “The Structuring Work of Algorithms.” Daedalus.

Egan. 2026. “Towards a Political Economy of Algorithmic Capitalism.” Capital & Class.

Farrell, and Fourcade. 2023. “The Moral Economy of High-Tech Modernism.” Daedalus.

Habaraduwa. 2024. “Inductive Models for Artificial Intelligence Systems Are Insufficient Without Good Explanations.”

Hardt. 2026. The Emerging Science of Machine Learning Benchmarks.

Hesselberth, Houwen, Peeren, et al. 2018. Legibility in the Age of Signs and Machines.

Kasy. 2026. “The Political Economy of AI: Toward Democratic Control of the Means of Prediction.” In The Oxford Handbook of Algorithmic Governance and the Law.

Laan. 2026. “A Researcher’s Guide to Empirical Risk Minimization.”

Mason. 2026. “From Scalars to Tensors: Declared Losses Recover Epistemic Distinctions That Neutrosophic Scalars Cannot Express.”

Norton. 2021. “The Material Theory of Induction.”

Prado. 2023. “Automated Patterns of Culture: Philosophy and Machine Learning.”

Raghavan. 2021. “The Societal Impacts of Algorithmic Decision-Making.”

Silver, Singh, Precup, et al. 2021. “Reward Is Enough.” Artificial Intelligence.

Steinberg, Oliveira, Ong, et al. 2024. “Variational Search Distributions.”

Steinberg, Wijesinghe, Oliveira, et al. 2025. “Amortized Active Generation of Pareto Sets.” In.

Sudhakar. 2026. “Biological Neural Networks as Substrate for Philosophical Constructs: A Unified Framework.”

Taylor, and Dorin. 2020. Rise of the Self-Replicators: Early Visions of Machines, AI and Robots That Can Reproduce and Evolve.

Valizada. 2026. “Artificial Intelligence and the Design of Politics in the Modern World.” Acta Globalis Humanitatis Et Linguarum.

Vamplew, Smith, Källström, et al. 2022. “Scalar Reward Is Not Enough: A Response to Silver, Singh, Precup and Sutton (2021).” Autonomous Agents and Multi-Agent Systems.

Venkatasubramanian, Scheidegger, Friedler, et al. 2021. “Fairness in Networks: Social Capital, Information Access, and Interventions.” In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. KDD ’21.

Zhi-Xuan, Carroll, Franklin, et al. 2025. “Beyond Preferences in AI Alignment.” Philosophical Studies.

Footnotes

FWIW, I cannot even work out what the thesis of Silver et al. (2021) is. It seems too under-specified to even agree with.↩︎