AI evals
Psychometrics for robots
2025-12-09 — 2026-01-18
Wherein the distinction between benchmarks and evals is laid out, and the rise of meta‑evaluations such as EvalEval and tools like Inspect are noted, with human judgements and user logs being employed.
A cousin of ML benchmarks.
At the Evaluating evals workshop at NeurIPS 2024, I decided I needed to understand the distinction between “benchmarks” and “evals” in ML/AI.
Benchmarks are standardized tests used to compare AI models on common tasks, while evals, as far as I can tell, are broader, often custom, evaluations used to judge whether a model is good enough for a particular use case or deployment context.
That is to say, benchmarks in ML/AI are fixed datasets and protocols meant to give “apples-to-apples” scores across models or model variants (e.g., MMLU, GSM8K, TruthfulQA for LLMs). They emphasize comparability and repeatability: same inputs, same metrics, so we can rank or track progress over time and publish leaderboards or marketing claims like “X% on benchmark Y.” Conceptually, benchmarks mostly answer “What can this model do in general, relative to others?” rather than “Is it good enough for my specific workflow?”
“Evals” (evaluations) are a broader category that focuses on how a system performs for a concrete task, product, or risk profile. In contemporary LLM practice, evals are often application-specific suites (e.g., for a customer-support bot, legal RAG system, or internal agent) that may mix automated metrics, human judgements, and logs from real users. They tend to be iterative and operational, used for error analysis, regression testing, safety checks, and release gating, rather than just for external comparison.
Benchmarks are typically public, static, and model-centric, while evals are often private, dynamic, and product- or context-centric (e.g., built from our own user data and goals). In practice, there is overlap—internal eval suites can double as “benchmarks” for marketing, and published benchmarks can be part of an eval pipeline—but the intent differs: comparison vs decision-making for a specific deployment.
More recently, a school of evals called EvalEval has arisen. EvalEval is a research coalition focused on meta-evaluation: understanding how benchmarks behave over time, how to design better evaluations, and how to document and standardize them (e.g., “evaluation cards,” harness tutorials). Their “evals” are not ad hoc; they’re systematically designed, documented, and scientifically grounded studies of AI systems’ capabilities, risks, and broader societal impacts. They claim to be more like classical science than the benchmark world science: their approach treats AI models and their evaluations as observable phenomena that can be studied in their own right, and which we analyze via RCTs, standard experimental methods, and statistical techniques, as I learned at NeurIPS 2025.
EvalEval also explicitly targets “broader impact evaluations” and decision-makers, meaning assessments of downstream risks, opportunities, and externalities, not just accuracy or utility for a single product. That adds a policy and governance dimension that goes beyond the earlier framing of evals as primarily for model and feature decisions within a single organization.
I am happy with this framing.
1 Statistical methods for evaluating AI systems
Historically, we’ve done this remarkably poorly. Many published works notoriously ignore construct validity, randomized controlled trials, statistical power, multiple comparisons, and other basic tenets of experimental design and analysis. Many have written on this theme; see e.g. (Biderman et al. 2024; Eriksson et al. 2025; Gevers et al. 2025; Salaudeen et al. 2025)
2 Inspect
A neat software framework from the UK AISI for evaluating LLMs and other AI systems.
Inspect can be used for a broad range of evaluations that measure coding, agentic tasks, reasoning, knowledge, behavior, and multi-modal understanding. Core features of Inspect include:
3 Incoming
Exploring LLM evaluations and benchmarking | genai-research – Weights & Biases
