Depate and interactive proof

2025-01-17 — 2026-02-13

Wherein interactive proof protocols are cast as prover–verifier games, their rounds and complexity from NP to NEXP being tabulated, and Stackelberg equilibria being sought via open-source tooling.

compsci

machine learning

making things

neural nets

Figure 1: Bob hides his proof from Alice.

Placeholder. Various notes on interactive proofs, debate, and related protocols.

1 Computational cost of explaining complex reasoning

Intuitively, it might (and maybe should) be computationally difficult to explain the reasoning of a large neural network. Surely simpler systems can’t explain more complex systems?

As an extreme example, we might run into incompleteness results. (Prove the consistency of the following system of axioms…) I can believe this kind of extreme esoterica wouldn’t be relevant in practice, but could problems close to these still arise in practice? TBC.

2 Neural interactive proofs

Neural Interactive Proofs (Hammond and Adam-Day 2025):

In our new paper, we study how a trusted, weak model can learn to interact with one or more stronger but untrusted models in order to solve tasks beyond the weak model’s capabilities. We introduce several new interaction protocols and evaluate them both theoretically and empirically alongside a range of earlier proposals. To facilitate further research on creating and evaluating different protocols for scalable oversight, we also provide a comprehensive open-source codebase.

[…] The main theoretical challenges are to: i) represent a given protocol in the form of a prover-verifier game (Kirchner et al. 2024); and ii) train the models to approximate the right equilibria of this game. While the first challenge is reasonably straightforward, the power of different protocols can vary greatly depending on several subtle details such as the number of messages the agents can send to each other, their ability to randomise, and whether messages can be sent privately to different agents. By taking these subtleties into account, we can show an equivalence between the equilibria of the game and valid proof systems for a range of different protocols.

Table 1: (from Hammond and Adam-Day (2025)) A comparison of the various proof protocols we discuss in our work. The “Complexity” column denotes the corresponding complexity class of decision problems that can be solved when represented as a (generalised) prover-verifier game played between unbounded provers and probabilistic polynomial time verifiers.

Model	Rounds	Complexity	Zero-knowledge
`adp`	2	NP	❌
`debate`	\(T\)	PSPACE	❌
`mac`	2	MA	❌
`nip`	\(T\)	PSPACE	❌
`mnip`	\(T\)	NEXP	❌
`zk-nip`	\(T\)	PSPACE	✅
`zk-mnip`	\(T\)	NEXP	✅

The second theoretical challenge arises because the equilibria that form this equivalence are (approximate) Stackelberg equilibria over the worst-case loss, which are difficult to optimise for using conventional machine learning algorithms. We discuss several approaches to overcoming this challenge, including the use of Stackelberg policy gradient and opponent-shaping algorithms to approximate Stackelberg equilibria, and the efficacy of average-case optimisation and adversarial training when it comes to minimising worst-case losses.

3 Prover-Estimator games

TBD

4 Incoming

5 References

Abbaszadeh, Pappas, Katz, et al. 2024. “Zero-Knowledge Proofs of Training for Deep Neural Networks.”

Allen-Zhu, and Xu. 2025. “DOGE: Reforming AI Conferences and Towards a Future Civilization of Fairness and Justice.” SSRN Scholarly Paper.

Barnes. 2020. “Debate Update: Obfuscated Arguments Problem.”

Barnes, and Christiano. 2020. “Writeup: Progress on AI Safety via Debate.”

Brown-Cohen, Irving, and Piliouras. 2025. “Avoiding Obfuscation with Prover-Estimator Debate.”

Hammond, and Adam-Day. 2025. “Neural Interactive Proofs.” In.

Irving, Christiano, and Amodei. 2018. “AI Safety via Debate.”

Kirchner, Chen, Edwards, et al. 2024. “Prover-Verifier Games Improve Legibility of LLM Outputs.”

Lang, Huang, and Li. 2025. “Debate Helps Weak-to-Strong Generalization.”

Michael, Mahdi, Rein, et al. 2023. “Debate Helps Supervise Unreliable Experts.”

Navajas, Niella, Garbulsky, et al. 2018. “Aggregated Knowledge from a Small Number of Debates Outperforms the Wisdom of Large Crowds.” Nature Human Behaviour.

Zakershahrak, and Ghodratnama. 2024. “Explanation, Debate, Align: A Weak-to-Strong Framework for Language Model Generalization.”