Toward Comprehensive Benchmarking of the Biological Knowledge of Frontier Large Language Models

Sunishchal Dev, Charles Teague, Grant Ellison, Kyle Brady, Jeffrey Lee, Sarah L. Gebauer, Henry Alexander Bradley, Dawid Maciorowski, Bria Persaud, Jordan Despanie, et al.

ResearchPublished Nov 25, 2025

Artificial intelligence (AI) systems demonstrate deep knowledge across a broad variety of scientific domains, including biology and chemistry, and bad actors could misuse some of these systems to develop biological or chemical weapons.

The constant development of more-capable models necessitates rapid evaluation mechanisms for governments to respond to emerging security risks in a timely manner. Policymakers, industry experts, and third-party evaluators lack a cohesive standard for testing AI systems' safety levels. These challenges complicate efforts to determine the degree to which frontier AI systems pose biological or chemical risks.

The authors evaluate the utility of misusing frontier AI systems to these ends. The authors focus on custom-tuned versions of open-weight AI models that can be modified to remove safety guardrails and/or potentially increase biological capabilities. For this report, the authors evaluated 39 of the most-capable models (as of May 2025) against six public biological and chemical knowledge benchmarks and two refusal benchmarks relevant to biological and chemical threats.

Key Findings

  • Frontier large language models (LLMs), led by reasoning models, are exceeding expert human performance on biology laboratory protocol and graduate-level question-answering benchmarks. All but three models that the authors tested surpassed nonexperts on a graduate-level biology benchmark.
  • Many publicly available biology and chemistry benchmarks are at or approaching saturation by the latest generation of models. Existing frontier models achieve near-maximum performance, meaning the benchmarks will be less useful for measuring capability gains in future models.
  • The effect of fine-tuning LLMs to remove safety training on dual-use biological capabilities and real-world risk remains unclear. The authors' "unsafety-tuning" was effective in reducing refusals to harmful requests, but it also resulted in performance drops on knowledge benchmarks. More work is needed to determine whether different training methods and data would increase unsafety-tuned model performance.

Recommendations

  • Benchmark creators should include human baselines to contextualize model performance, and baseline developers should document their methods for recruiting and testing human baseliners. For the most-realistic expert baselines, each human should be tested only on questions within specific domains of expertise rather than full question sets covering a broader variety of subtopics within a given field.
  • Benchmark creators should focus on making more-challenging and -specialized evaluations with thorough quality assurance measures to mitigate the saturation problem. Some benchmark datasets should remain private or semiprivate to avoid training data contamination.
  • Benchmark implementors should increase standardization by reporting key benchmark implementation details, which will improve reproducibility and comparability across benchmark results.

Topics

Document Details

Citation

Chicago Manual of Style

Dev, Sunishchal, Charles Teague, Grant Ellison, Kyle Brady, Jeffrey Lee, Sarah L. Gebauer, Henry Alexander Bradley, Dawid Maciorowski, Bria Persaud, Jordan Despanie, Barbara Del Castello, Alyssa Worland, Michael Miller, Adrian Salas, Dave Nguyen, James Liu, Jason Johnson, Andrew Sloan, Will Stonehouse, Travis Merrill, Thomas Goode, Greg McKelvey, Jr., and Ella Guest, Toward Comprehensive Benchmarking of the Biological Knowledge of Frontier Large Language Models. Santa Monica, CA: RAND Corporation, 2025. https://www.rand.org/pubs/research_reports/RRA3797-1.html.
BibTeX RIS

Research conducted by

This publication is part of the RAND research report series. Research reports present research findings and objective analysis that address the challenges facing the public and private sectors. All RAND research reports undergo rigorous peer review to ensure high standards for research quality and objectivity.

This document and trademark(s) contained herein are protected by law. This representation of RAND intellectual property is provided for noncommercial use only. Unauthorized posting of this publication online is prohibited; linking directly to this product page is encouraged. Permission is required from RAND to reproduce, or reuse in another form, any of its research documents for commercial purposes. For information on reprint and reuse permissions, please visit www.rand.org/pubs/permissions.

RAND is a nonprofit institution that helps improve policy and decisionmaking through research and analysis. RAND's publications do not necessarily reflect the opinions of its research clients and sponsors.

Version Note

This publication supersedes a previous version published in 2025 (WR-A3797-1).