Position: Human Baselines in Model Evaluations Need Rigor and Transparency: (With Recommendations &amp; Reporting Checklist)

Wei, Kevin; Paskov, Patricia; Dev, Sunishchal; Byun, Michael J.; Reuel, Anka; Roberts-Gaal, Xavier; Calcott, Rachel; Coxon, Evie; Deshpande, Chinmay

Position: Human Baselines in Model Evaluations Need Rigor and Transparency

(With Recommendations & Reporting Checklist)

Kevin Wei, Patricia Paskov, Sunishchal Dev, Michael J. Byun, Anka Reuel, Xavier Roberts-Gaal, Rachel Calcott, Evie Coxon, Chinmay Deshpande

ResearchPosted on rand.org Feb 6, 2026Published in: Proceedings of the 42nd International Conference on Machine Learning, Proceedings of Machine Learning Research (PMLR), Volume 267 (July 2025)

In this position paper, we argue that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance, and we provide recommendations and a reporting checklist towards this end. Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve "super-human" performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework with recommendations for designing, executing, and reporting human baselines. We synthesize our recommendations into a checklist that we use to systematically review 115 human baselines (studies) in foundation model evaluations and thus identify shortcomings in existing baselining methods; our checklist can also assist researchers in conducting human baselines and reporting results. We hope our work can advance more rigorous AI evaluation practices that can better serve both the research community and policymakers.

Document Details

Copyright: Kevin Wei, Patricia Paskov, Sunishchal Dev, Michael J. Byun, Anka Reuel, Xavier Roberts-Gaal, Rachel Calcott, Evie Coxon, Chinmay Deshpande
Publisher: MLResearchPress
Availability: Non-RAND
Year: 2025
Pages: 61
Document Number: EP-71235

Research conducted by

RAND Global and Emerging Risks

This publication is part of the RAND external publication series. Many RAND studies are published in peer-reviewed scholarly journals, as chapters in commercial books, or as documents published by other organizations.

RAND is a nonprofit institution that helps improve policy and decisionmaking through research and analysis. RAND's publications do not necessarily reflect the opinions of its research clients and sponsors.

Position: Human Baselines in Model Evaluations Need Rigor and Transparency

Topics

Document Details

Research conducted by

RAND Headquarters

U.S. research divisions

International research divisions

Position: Human Baselines in Model Evaluations Need Rigor and Transparency

Topics

Document Details

Research conducted by