Agent foundations
Between humans and machines
2025-08-05 — 2025-09-27
Wherein a formal goal is examined, and a counterfactual question‑answer scheme called QACI is described, whereby information blobs are located and rewritten across possible computational worlds and scored.
Agent foundations is the branch of AI alignment that tries to answer: if we were to build a superintelligent system from scratch, what clean, mathematical objective could we give it so that it robustly does what we want, even if we cannot understand the system ourselves? Unlike interpretability (which inspects black-box models) or preference learning (which tries to extract human values), agent foundations is about first principles: designing an agent that’s “aligned by construction.”
So what might this look like formalised? The central idea is to create a “formal goal” that is entirely made of math—not of human words with underspecified meaning.
There seem to be a few speculative proposals in this space, including QACI, indirect normativity and utility indifference — These are representative of the kind of formalism agent foundations explores.
1 QACI: Question-Answer Counterfactual Interval
One speculative proposal in the agent-foundations space is QACI, the “question-answer counterfactual interval”. It is not the whole of agent foundations — just one of the weirder and more formalized sketches of what a “clean mathematical goal” might look like.
Very roughly: QACI tries to give an agent a way of scoring its actions by asking counterfactual questions. The picture is: if I acted one way rather than another, how would the answer to a particular well-posed question about the world come out differently? The hope is that if we can ground goals in question-answer pairs, and measure how much the agent’s action shifts those answers across counterfactual worlds, we get something precise enough to avoid ontology trouble and maybe even Goodharting.
Here are the core ingredients, as I understand them:
- Counterfactuals — The system doesn’t just evaluate the actual world, but counterfactual ones: “what if this string of data were replaced with that one?”
- Blob location — The zany part. This is about locating a bitstring (the “blob”) across possible computational universes. We define functions that pick out the blob from different states, and functions that rewrite the blob to see what happens under perturbation.
- Ontological robustness — The value proposition. Instead of pointing at “chairs” or “trees” (which depend on messy, shifting ontologies), the system points at raw information. That should, in theory, let the system survive when its ontology of the world gets rebuilt at higher levels of intelligence.
The promise is attractive, I guess: give the AI a mathematically exact target that refers only to information and counterfactual dependence, and maybe we get behaviour that’s safe and robust.
AFAICT QACI remains speculative and hard to parse. There are a lot of axioms to buy into to make it look even remotely feasible. It’s not obvious how to connect “blobs” to real-world referents, or why this formulation really sidesteps Goodhart’s Law. Even insiders hedge on whether it’s the right line to pursue. Still, it illustrates the flavour of what agent-foundations research is trying to do: carve a trustworthy goal into the inviolable substrate of mathematics so that even a superintelligence could be trusted to pursue it.