Toward Comprehensive Benchmarking of the Biological Knowledge of Frontier Large Language Models

Dev, Sunishchal; Teague, Charles; Ellison, Grant; Brady, Kyle; Lee, Jeffrey; Gebauer, Sarah L.; Bradley, Henry Alexander; Maciorowski, Dawid; Persaud, Bria; Despanie, Jordan; Del Castello, Barbara; Worland, Alyssa; Miller, Michael; Salas, Adrian; Nguyen, Dave; Liu, James; Johnson, Jason; Sloan, Andrew; Stonehouse, Will; Merrill, Travis; Goode, Thomas; McKelvey, Greg, Jr.; Guest, Ella

Artificial intelligence (AI) systems demonstrate deep knowledge across a broad variety of scientific domains, including biology and chemistry, and bad actors could misuse some of these systems to develop biological or chemical weapons. In this report, the authors evaluate the most-capable AI models (as of May 2025) against eight leading knowledge benchmarks to determine the degree to which frontier AI systems pose biological or chemical risks.

Toward Comprehensive Benchmarking of the Biological Knowledge of Frontier Large Language Models

Sunishchal Dev, Charles Teague, Grant Ellison, Kyle Brady, Jeffrey Lee, Sarah L. Gebauer, Henry Alexander Bradley, Dawid Maciorowski, Bria Persaud, Jordan Despanie, et al.

ResearchPublished Nov 25, 2025

Cover: Toward Comprehensive Benchmarking of the Biological Knowledge of Frontier Large Language Models

Download PDF

The constant development of more-capable models necessitates rapid evaluation mechanisms for governments to respond to emerging security risks in a timely manner. Policymakers, industry experts, and third-party evaluators lack a cohesive standard for testing AI systems' safety levels. These challenges complicate efforts to determine the degree to which frontier AI systems pose biological or chemical risks.

The authors evaluate the utility of misusing frontier AI systems to these ends. The authors focus on custom-tuned versions of open-weight AI models that can be modified to remove safety guardrails and/or potentially increase biological capabilities. For this report, the authors evaluated 39 of the most-capable models (as of May 2025) against six public biological and chemical knowledge benchmarks and two refusal benchmarks relevant to biological and chemical threats.

Key Findings

Frontier large language models (LLMs), led by reasoning models, are exceeding expert human performance on biology laboratory protocol and graduate-level question-answering benchmarks. All but three models that the authors tested surpassed nonexperts on a graduate-level biology benchmark.
Many publicly available biology and chemistry benchmarks are at or approaching saturation by the latest generation of models. Existing frontier models achieve near-maximum performance, meaning the benchmarks will be less useful for measuring capability gains in future models.
The effect of fine-tuning LLMs to remove safety training on dual-use biological capabilities and real-world risk remains unclear. The authors' "unsafety-tuning" was effective in reducing refusals to harmful requests, but it also resulted in performance drops on knowledge benchmarks. More work is needed to determine whether different training methods and data would increase unsafety-tuned model performance.

Recommendations

Benchmark creators should include human baselines to contextualize model performance, and baseline developers should document their methods for recruiting and testing human baseliners. For the most-realistic expert baselines, each human should be tested only on questions within specific domains of expertise rather than full question sets covering a broader variety of subtopics within a given field.
Benchmark creators should focus on making more-challenging and -specialized evaluations with thorough quality assurance measures to mitigate the saturation problem. Some benchmark datasets should remain private or semiprivate to avoid training data contamination.
Benchmark implementors should increase standardization by reporting key benchmark implementation details, which will improve reproducibility and comparability across benchmark results.

Topics

Document Details

Copyright: RAND Corporation
Availability: Web-Only
Year: 2025
Pages: 67
DOI: https://doi.org/10.7249/RRA3797-1
Document Number: RR-A3797-1

Citation

Chicago Manual of Style

Dev, Sunishchal, Charles Teague, Grant Ellison, Kyle Brady, Jeffrey Lee, Sarah L. Gebauer, Henry Alexander Bradley, Dawid Maciorowski, Bria Persaud, Jordan Despanie, Barbara Del Castello, Alyssa Worland, Michael Miller, Adrian Salas, Dave Nguyen, James Liu, Jason Johnson, Andrew Sloan, Will Stonehouse, Travis Merrill, Thomas Goode, Greg McKelvey, Jr., and Ella Guest, Toward Comprehensive Benchmarking of the Biological Knowledge of Frontier Large Language Models. Santa Monica, CA: RAND Corporation, 2025. https://www.rand.org/pubs/research_reports/RRA3797-1.html.

BibTeX RIS

Research conducted by

RAND Global and Emerging Risks

This work was independently initiated and conducted within the Technology and Security Policy Center of RAND Global and Emerging Risks using income from operations and gifts from philanthropic supporters. A complete list of donors and funders is available at www.rand.org/TASP.

This publication is part of the RAND research report series. Research reports present research findings and objective analysis that address the challenges facing the public and private sectors. All RAND research reports undergo rigorous peer review to ensure high standards for research quality and objectivity.

This document and trademark(s) contained herein are protected by law. This representation of RAND intellectual property is provided for noncommercial use only. Unauthorized posting of this publication online is prohibited; linking directly to this product page is encouraged. Permission is required from RAND to reproduce, or reuse in another form, any of its research documents for commercial purposes. For information on reprint and reuse permissions, please visit www.rand.org/pubs/permissions.

RAND is a nonprofit institution that helps improve policy and decisionmaking through research and analysis. RAND's publications do not necessarily reflect the opinions of its research clients and sponsors.

Version Note

This publication supersedes a previous version published in 2025 (WR-A3797-1).

Toward Comprehensive Benchmarking of the Biological Knowledge of Frontier Large Language Models

Key Findings

Recommendations

Topics

Document Details

Citation

RAND Style Manual

Chicago Manual of Style

Research conducted by

Version Note

RAND Headquarters

U.S. research divisions

International research divisions

Toward Comprehensive Benchmarking of the Biological Knowledge of Frontier Large Language Models

Key Findings

Recommendations

Topics

Document Details

Citation

RAND Style Manual

Chicago Manual of Style

Research conducted by

Version Note