Simpler Is Better for Autograders
Toward Cost-Effective LLM Evaluations for Open-Ended Tasks
ResearchPublished Apr 22, 2026
Across four expert-graded benchmarks and five LLMs, a simple single-rubric autograder consistently beat more-complex grading pipelines. It reduced error by 9 to 25 percentage points, often matched or exceeded nonexpert human graders, and cut down grading time and cost substantially. This report shows that scalable evaluation for open-ended tasks does not require elaborate prompting or optimization.
Toward Cost-Effective LLM Evaluations for Open-Ended Tasks
ResearchPublished Apr 22, 2026
The increasing capabilities of large language models (LLMs) have driven a need for rigorous, scalable evaluation frameworks. One of the primary bottlenecks in meeting this demand is the cost of human grading of model outputs: Expert human graders are the gold standard for quality assessment but their effort is expensive and time-consuming. Automated methods—ranging from traditional natural language processing metrics to simpler string-matching or regular-expression techniques—offer lower-cost alternatives but often fail to capture semantic nuance and can be brittle in the face of variations in formatting or phrasing.
The common pairwise setting—in which LLMs choose the better of two responses—has been well studied for using LLMs as a judge. Pairwise grading, however, has limited utility in certain open-ended domains in which a pair of responses is not available or when a more nuanced scoring scale is required to understand differences in response quality.
In this report, the authors focus on pointwise scoring for more-flexible, reference-free evaluation tasks, referring to these pointwise LLM graders autograders. The report presents an empirical comparison of five approaches to such tasks: the single rubric method, metaprompting, the list-of-items method, criteria decomposition, and declarative self-improving Python (DSPy) prompt optimization. These methods are tested across four expert-graded benchmarks and five LLMs.
This research was independently initiated and conducted by the Center on AI, Security, and Technology within RAND Global and Emerging Risks using income from operations and gifts and grants from philanthropic supporters.
This publication is part of the RAND research report series. Research reports present research findings and objective analysis that address the challenges facing the public and private sectors. All RAND research reports undergo rigorous peer review to ensure high standards for research quality and objectivity.
This document and trademark(s) contained herein are protected by law. This representation of RAND intellectual property is provided for noncommercial use only. Unauthorized posting of this publication online is prohibited; linking directly to this product page is encouraged. Permission is required from RAND to reproduce, or reuse in another form, any of its research documents for commercial purposes. For information on reprint and reuse permissions, please visit www.rand.org/pubs/permissions.
RAND is a nonprofit institution that helps improve policy and decisionmaking through research and analysis. RAND's publications do not necessarily reflect the opinions of its research clients and sponsors.