Evaluating Large Language Models' Abilities to Process and Understand Technical Policy Reports

Abdurahaman, Zara Fatima; Puri, Prateek; Ahmadi, Mohammad; Geist, Edward

The authors detail their development of a specialized benchmark for evaluating large language models' abilities to process and understand technical policy reports, thus addressing a gap in existing domain-specific evaluation. The authors document the development process and their preliminary benchmark testing results, share the lessons learned during that process, and provide recommendations for future work.

Evaluating Large Language Models' Abilities to Process and Understand Technical Policy Reports

Zara Fatima Abdurahaman, Prateek Puri, Mohammad Ahmadi, Edward Geist

ResearchPublished Apr 28, 2026

Cover: Evaluating Large Language Models' Abilities to Process and Understand Technical Policy Reports

Download PDF

Large language models (LLMs) — with their ability to process, organize, and summarize large volumes of information — are increasingly being evaluated as tools to support policy research and analysis. For instance, frameworks, such as retrieval-augmented generation and GraphRAG, enable LLMs to connect to non-public corpora or policy-specific document repositories to support such tasks as retrieving contextually relevant information, providing factually grounded answers, synthesizing findings across reports, identifying evidence gaps, and facilitating systematic reviews. As these potential use cases and tools are explored, it is important to assess how well LLMs perform and how reliably they operate in policy-relevant settings. Existing evaluations, such as the Massive Multitask Language Understanding benchmark and the Beyond the Imitation Game benchmark, provide useful information about LLMs' capacities for factual recall and general reasoning. However, they do not fully capture performance on domain-specific, real-world policy tasks.

In this report, the authors detail their development of a specialized benchmark for evaluating LLMs' abilities to process and understand technical policy reports, thus addressing a gap in existing LLM domain-specific evaluation. The authors developed the benchmark specifically to target policy-relevant applications by creating a dataset of claims that can be evaluated for its faithfulness to source research reports. Producing the benchmark dataset combined human expertise with artificial intelligence (AI) assistance to achieve scalability while maintaining quality. The authors document the development process and their preliminary benchmark testing results, and they share the lessons learned during the development process while providing recommendations for future work.

Key Findings

Domain-specific benchmarks can provide useful insights for assessing specialized applications.

Specialized benchmarks help gauge how well LLMs perform in practical, domain-specific contexts. By incorporating materials from published documents within the policy domain, the authors' benchmark reflected tasks and challenges of real-world policy research settings.

Evaluating domain-specific reasoning is imperative, and more-nuanced truthfulness categories will better capture real-world policy analysis.

The authors used their benchmark to test LLM systems' abilities to evaluate the truthfulness of a series of policy-relevant claims. Instead of categorizing truthfulness in binary terms (e.g., true or false), the authors created six different categories of truthfulness. This approach reveals differences in how generative AI systems handle unsupported assertions, partial inaccuracies, inferred reasoning, and conflicting opinions.

Preliminary testing suggested that baseline LLM systems might need further modification and stress-testing before they can reliably support high-stakes policy work.

In preliminary evaluations, the tested systems' overall accuracy ranged between 48 percent and 54 percent, suggesting that — in their current forms — these baseline configurations might require modification and testing to ensure that they can accurately interpret complex information.

As part of building the benchmark, the authors experimented with automated claim‑generation methods but found that the systems struggled to create complex, high-quality claims.

The authors applied a human-AI hybrid approach to create the benchmark but had to intervene substantially to refine or correct the AI output.

Recommendations

Researchers might wish to explore a broader variety of claim‑generation techniques, including existing methods that the authors did not examine, to better capture the interpretative complexity relevant to policy analysis.
Researchers could also look into hybrid human-AI workflows that can leverage systems' capabilities while maintaining human oversight.
Researchers could investigate newer reasoning models, model architectures, retrieval approaches, or context‑integration frameworks that might improve support for the more context-based reasoning tasks.
Researchers could expand the authors' benchmark to encompass additional policy domains and document types to increase the benchmark's applicability.
Researchers could extend the benchmark to include cross-document claims, which are helpful for such policy analysis workflows as synthesizing evidence across reports, identifying consensus and conflicts between sources, and tracking policy recommendations over time.

Topics

Document Details

Copyright: RAND Corporation
Availability: Web-Only
Year: 2026
Pages: 30
DOI: https://doi.org/10.7249/RRA4269-1
Document Number: RR-A4269-1

Citation

Chicago Manual of Style

Abdurahaman, Zara Fatima, Prateek Puri, Mohammad Ahmadi, and Edward Geist, Evaluating Large Language Models' Abilities to Process and Understand Technical Policy Reports. Santa Monica, CA: RAND Corporation, 2026. https://www.rand.org/pubs/research_reports/RRA4269-1.html.

BibTeX RIS

Research conducted by

RAND National Security Research Division

This work was prepared for the Defense Advanced Research Projects Agency (DARPA) and conducted within the Acquisition and Technology Policy Program of the RAND National Security Research Division.

This publication is part of the RAND research report series. Research reports present research findings and objective analysis that address the challenges facing the public and private sectors. All RAND research reports undergo rigorous peer review to ensure high standards for research quality and objectivity.

This document and trademark(s) contained herein are protected by law. This representation of RAND intellectual property is provided for noncommercial use only. Unauthorized posting of this publication online is prohibited; linking directly to this product page is encouraged. Permission is required from RAND to reproduce, or reuse in another form, any of its research documents for commercial purposes. For information on reprint and reuse permissions, please visit www.rand.org/pubs/permissions.

RAND is a nonprofit institution that helps improve policy and decisionmaking through research and analysis. RAND's publications do not necessarily reflect the opinions of its research clients and sponsors.

Evaluating Large Language Models' Abilities to Process and Understand Technical Policy Reports

Key Findings

Domain-specific benchmarks can provide useful insights for assessing specialized applications.

Evaluating domain-specific reasoning is imperative, and more-nuanced truthfulness categories will better capture real-world policy analysis.

Preliminary testing suggested that baseline LLM systems might need further modification and stress-testing before they can reliably support high-stakes policy work.

As part of building the benchmark, the authors experimented with automated claim‑generation methods but found that the systems struggled to create complex, high-quality claims.

Recommendations

Topics

Document Details

Citation

RAND Style Manual

Chicago Manual of Style

Research conducted by

RAND Headquarters

U.S. research divisions

International research divisions

Evaluating Large Language Models' Abilities to Process and Understand Technical Policy Reports

Key Findings

Domain-specific benchmarks can provide useful insights for assessing specialized applications.

Evaluating domain-specific reasoning is imperative, and more-nuanced truthfulness categories will better capture real-world policy analysis.

Preliminary testing suggested that baseline LLM systems might need further modification and stress-testing before they can reliably support high-stakes policy work.

As part of building the benchmark, the authors experimented with automated claim‑generation methods but found that the systems struggled to create complex, high-quality claims.

Recommendations

Topics

Document Details

Citation

RAND Style Manual

Chicago Manual of Style

Research conducted by