Judge Reliability Harness
RAND researchers developed the Judge Reliability Harness, an open-source library that orchestrates standardized, reproducible evaluations of large language model–based judges through systematic perturbation testing and human-in-the-loop validation.