AI evals

Psychometrics for robots

2025-12-09 — 2026-01-18

Wherein the distinction between benchmarks and evals is laid out, and the rise of meta‑evaluations such as EvalEval and tools like Inspect are noted, with human judgements and user logs being employed.

economics

game theory

how do science

incentive mechanisms

institutions

machine learning

neural nets

statistics

A cousin of ML benchmarks.

At the Evaluating evals workshop at NeurIPS 2024, I decided I needed to understand the distinction between “benchmarks” and “evals” in ML/AI.

Benchmarks are standardized tests used to compare AI models on common tasks, while evals, as far as I can tell, are broader, often custom, evaluations used to judge whether a model is good enough for a particular use case or deployment context.

That is to say, benchmarks in ML/AI are fixed datasets and protocols meant to give “apples-to-apples” scores across models or model variants (e.g., MMLU, GSM8K, TruthfulQA for LLMs). They emphasize comparability and repeatability: same inputs, same metrics, so we can rank or track progress over time and publish leaderboards or marketing claims like “X% on benchmark Y.” Conceptually, benchmarks mostly answer “What can this model do in general, relative to others?” rather than “Is it good enough for my specific workflow?”

“Evals” (evaluations) are a broader category that focuses on how a system performs for a concrete task, product, or risk profile. In contemporary LLM practice, evals are often application-specific suites (e.g., for a customer-support bot, legal RAG system, or internal agent) that may mix automated metrics, human judgements, and logs from real users. They tend to be iterative and operational, used for error analysis, regression testing, safety checks, and release gating, rather than just for external comparison.

Benchmarks are typically public, static, and model-centric, while evals are often private, dynamic, and product- or context-centric (e.g., built from our own user data and goals). In practice, there is overlap—internal eval suites can double as “benchmarks” for marketing, and published benchmarks can be part of an eval pipeline—but the intent differs: comparison vs decision-making for a specific deployment.

More recently, a school of evals called EvalEval has arisen. EvalEval is a research coalition focused on meta-evaluation: understanding how benchmarks behave over time, how to design better evaluations, and how to document and standardize them (e.g., “evaluation cards,” harness tutorials). Their “evals” are not ad hoc; they’re systematically designed, documented, and scientifically grounded studies of AI systems’ capabilities, risks, and broader societal impacts. They claim to be more like classical science than the benchmark world science: their approach treats AI models and their evaluations as observable phenomena that can be studied in their own right, and which we analyze via RCTs, standard experimental methods, and statistical techniques, as I learned at NeurIPS 2025.

EvalEval also explicitly targets “broader impact evaluations” and decision-makers, meaning assessments of downstream risks, opportunities, and externalities, not just accuracy or utility for a single product. That adds a policy and governance dimension that goes beyond the earlier framing of evals as primarily for model and feature decisions within a single organization.

I am happy with this framing.

1 Statistical methods for evaluating AI systems

Historically, we’ve done this remarkably poorly. Many published works notoriously ignore construct validity, randomized controlled trials, statistical power, multiple comparisons, and other basic tenets of experimental design and analysis. Many have written on this theme; see e.g. (Biderman et al. 2024; Eriksson et al. 2025; Gevers et al. 2025; Salaudeen et al. 2025).

2 Inspect

A neat software framework from the UK AISI for evaluating LLMs and other AI systems.

Inspect can be used for a broad range of evaluations that measure coding, agentic tasks, reasoning, knowledge, behavior, and multi-modal understanding. Core features of Inspect include:

3 Incoming

Exploring LLM evaluations and benchmarking | genai-research – Weights & Biases

4 References

Biderman, Schoelkopf, Anthony, et al. 2023. “Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling.” In Proceedings of the 40th International Conference on Machine Learning.

Biderman, Schoelkopf, Sutawika, et al. 2024. “Lessons from the Trenches on Reproducible Evaluation of Language Models.”

Burden. 2024. “Evaluating AI Evaluation: Perils and Prospects.”

Chen, Sun, and Du. 2025. “Causal Discovery via Quantile Partial Effect.”

Eriksson, Purificato, Noroozian, et al. 2025. “Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation.”

Gevers, De Marez, Van Nooten, et al. 2025. “In Benchmarks We Trust … Or Not?” In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing.

Lee, Kim, Kim, et al. 2025. “CheckEval: A Reliable LLM-as-a-Judge Framework for Evaluating Text Generation Using Checklists.” In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing.

Lesci, Meister, Hofmann, et al. 2024. “Causal Estimation of Memorisation Profiles.” In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

Li, Dou, Shao, et al. 2025. “Evaluating Scoring Bias in LLM-as-a-Judge.”

Luettgau, Coppock, Dubois, et al. 2025. “HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statistics.”

Reuel, Hardy, Smith, et al. 2024. “BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices.”

Salaudeen, Reuel, Ahmed, et al. 2025. “Measurement to Meaning: A Validity-Centered Framework for AI Evaluation.”

Viswanathan, Sun, Ma, et al. 2025. “Checklists Are Better Than Reward Models For Aligning Language Models.” In.

Wal, Lesci, Müller-Eberstein, et al. 2024. “PolyPythias: Stability and Outliers Across Fifty Language Model Pre-Training Runs.” In.

Zhou, and Kao. 2025. “Flattening Hierarchies with Policy Bootstrapping.” In.