AI evals

Psychometrics for robots

2025-11-09 — 2025-12-09

Wherein the distinction between public, static benchmarks and private, context‑specific evals is laid out, and the EvalEval movement’s use of evaluation cards and RCT‑style studies is noted.

economics
game theory
how do science
incentive mechanisms
institutions
machine learning
neural nets
statistics

A cousin of ML benchmarks.

Figure 1

At the Evaluating evals workshop at NeurIPS 2024, I decided I needed to understand the distinction between “benchmarks” and “evals” in ML/AI.

Benchmarks are standardized tests used to compare AI models on common tasks, while evals are broader, often custom, evaluations used to judge whether a model is good enough for a particular use case or deployment context.

That is to say, benchmarks in ML/AI are fixed datasets and protocols meant to give “apples-to-apples” scores across models or model variants (e.g., MMLU, GSM8K, TruthfulQA for LLMs). They emphasize comparability and repeatability: same inputs, same metrics, so we can rank or track progress over time and publish leaderboards or marketing claims like “X% on benchmark Y.” Conceptually, benchmarks mostly answer “What can this model do in general, relative to others?” rather than “Is it good enough for my specific workflow?”​

“Evals” (evaluations) are a broader category that focuses on how a system performs for a concrete task, product, or risk profile. In contemporary LLM practice, evals are often application-specific suites (e.g., for a customer-support bot, legal RAG system, or internal agent) that may mix automated metrics, human judgements, and logs from real users. They tend to be iterative and operational, used for error analysis, regression testing, safety checks, and release gating, rather than just for external comparison.

Benchmarks are typically public, static, and model-centric, while evals are often private, dynamic, and product- or context-centric (e.g., built from our own user data and goals). In practice, there is overlap—internal eval suites can double as “benchmarks” for marketing, and published benchmarks can be part of an eval pipeline—but the intent differs: comparison vs decision-making for a specific deployment.

More recently, a Eval eval school of evals has arisen. EvalEval is a research coalition focused on meta-evaluation: understanding how benchmarks behave over time, how to design better evaluations, and how to document and standardize them (e.g., “evaluation cards,” harness tutorials). Their “evals” try not to be ad hoc but are systematically designed, documented, and scientifically grounded studies of AI systems’ capabilities, risks, and broader societal impacts. They claim to be more like classical science than the benchmark world science: their approach treats AI models and their evaluations as observable phenomena that can be studied in their own right, and which we analyse via RCTs, standard experimental methods, and statistical techniques, as I learned at Neurips 2025.

EvalEval also explicitly targets “broader impact evaluations” and decision-makers, meaning assessments of downstream risks, opportunities, and externalities, not just accuracy or utility for a single product. That adds a policy and governance dimension that goes beyond the earlier framing of evals as primarily for model and feature decisions within a single organization.

I am happy with this framing.

1 Inspect

A neat software framework from UK AISI for evaluating LLMs and other AI systems.

Inspect can be used for a broad range of evaluations that measure coding, agentic tasks, reasoning, knowledge, behavior, and multi-modal understanding. Core features of Inspect include:

2 Incoming

Exploring LLM evaluations and benchmarking | genai-research – Weights & Biases

3 References

Biderman, Schoelkopf, Anthony, et al. 2023. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling.” In Proceedings of the 40th International Conference on Machine Learning.
Burden. 2024. Evaluating AI Evaluation: Perils and Prospects.”
Chen, Sun, and Du. 2025. Causal Discovery via Quantile Partial Effect.”
Lee, Kim, Kim, et al. 2025. CheckEval: A Reliable LLM-as-a-Judge Framework for Evaluating Text Generation Using Checklists.” In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing.
Lesci, Meister, Hofmann, et al. 2024. Causal Estimation of Memorisation Profiles.” In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
Li, Dou, Shao, et al. 2025. Evaluating Scoring Bias in LLM-as-a-Judge.”
Luettgau, Coppock, Dubois, et al. 2025. HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statistics.”
Viswanathan, Sun, Ma, et al. 2025. “Checklists Are Better Than Reward Models For Aligning Language Models.” In.
Wal, Lesci, Müller-Eberstein, et al. 2024. PolyPythias: Stability and Outliers Across Fifty Language Model Pre-Training Runs.” In.
Zhou, and Kao. 2025. Flattening Hierarchies with Policy Bootstrapping.” In.