Preliminary suggestions for rigorous GPAI model evaluations
This document presents suggestions that promote the internal validity, external validity and reproducibility of general-purpose AI evaluations, including benchmark evaluations and human uplift studies, across four evaluation life cycle stages.