Simpler Is Better for Autograders

Toward Cost-Effective LLM Evaluations for Open-Ended Tasks

Sunishchal Dev, Patricia Paskov, Andrew Sloan, Kevin Wei, Pedro Nascimento de Lima, Swaptik Chowdhury, Jason Johnson, William Marcellino

ResearchPublished Apr 22, 2026

The increasing capabilities of large language models (LLMs) have driven a need for rigorous, scalable evaluation frameworks. One of the primary bottlenecks in meeting this demand is the cost of human grading of model outputs: Expert human graders are the gold standard for quality assessment but their effort is expensive and time-consuming. Automated methods—ranging from traditional natural language processing metrics to simpler string-matching or regular-expression techniques—offer lower-cost alternatives but often fail to capture semantic nuance and can be brittle in the face of variations in formatting or phrasing.

The common pairwise setting—in which LLMs choose the better of two responses—has been well studied for using LLMs as a judge. Pairwise grading, however, has limited utility in certain open-ended domains in which a pair of responses is not available or when a more nuanced scoring scale is required to understand differences in response quality.

In this report, the authors focus on pointwise scoring for more-flexible, reference-free evaluation tasks, referring to these pointwise LLM graders autograders. The report presents an empirical comparison of five approaches to such tasks: the single rubric method, metaprompting, the list-of-items method, criteria decomposition, and declarative self-improving Python (DSPy) prompt optimization. These methods are tested across four expert-graded benchmarks and five LLMs.

Key Findings

  • The single-rubric method consistently provided the largest and most statistically significant reduction in normalized mean absolute error (9 to 25 percentage points), outperforming more-complex autograding pipelines.
  • In several cases, the single-rubric method matched or exceeded the accuracy of nonexpert human graders but cost more than a thousand times less in time and money.
  • Metaprompting and DSPy optimization underperform, often introducing noise or overfitting to synthetic validation data.
  • Criteria decomposition shows consistent underperformance relative to simpler approaches, indicating that additional structure does not necessarily improve LLM grading accuracy.
  • The list-of-items method may be beneficial for small models or tasks with long itemized criteria but offers no consistent advantage overall.

Recommendations

  • Use the single-rubric method as the default autograding method for rubric-scored open-ended tasks for the most accurate, cheapest, and most reliable method across domains.
  • Avoid overly complex prompting or optimization methods (e.g., metaprompting, DSPy) unless domain-specific evidence suggests they improve performance.
  • If using smaller LLMs or grading tasks involving long lists of criteria, consider the list-of-items method but validate its performance relative to the single-rubric method.
  • Given the superior cost-efficiency and comparable accuracy, use autograders to replace or augment nonexpert human graders in large-scale evaluation pipelines.

Topics

Document Details

Citation

Chicago Manual of Style

Dev, Sunishchal, Patricia Paskov, Andrew Sloan, Kevin Wei, Pedro Nascimento de Lima, Swaptik Chowdhury, Jason Johnson, and William Marcellino, Simpler Is Better for Autograders: Toward Cost-Effective LLM Evaluations for Open-Ended Tasks. Santa Monica, CA: RAND Corporation, 2026. https://www.rand.org/pubs/research_reports/RRA4618-1.html.
BibTeX RIS

Research conducted by

This publication is part of the RAND research report series. Research reports present research findings and objective analysis that address the challenges facing the public and private sectors. All RAND research reports undergo rigorous peer review to ensure high standards for research quality and objectivity.

This document and trademark(s) contained herein are protected by law. This representation of RAND intellectual property is provided for noncommercial use only. Unauthorized posting of this publication online is prohibited; linking directly to this product page is encouraged. Permission is required from RAND to reproduce, or reuse in another form, any of its research documents for commercial purposes. For information on reprint and reuse permissions, please visit www.rand.org/pubs/permissions.

RAND is a nonprofit institution that helps improve policy and decisionmaking through research and analysis. RAND's publications do not necessarily reflect the opinions of its research clients and sponsors.