No-Knowledge Alarms for Misaligned LLMs-as-Judges
This addresses the challenge of ensuring reliability in LLM-based evaluation systems for AI safety, though it is incremental as it builds on existing logical consistency methods.
The paper tackles the problem of monitoring LLMs used as judges to evaluate other LLMs when ground truth is unknown, by exploiting logical consistency in disagreements to compute possible evaluations of grading ability. It results in no-knowledge alarms that detect misaligned judges with no false positives, based on a Linear Programming formalization.
If we use LLMs as judges to evaluate the complex decisions of other LLMs, who or what monitors the judges? Infinite monitoring chains are inevitable whenever we do not know the ground truth of the decisions by experts and we do not want to trust them. One way to ameliorate our evaluation uncertainty is to exploit the use of logical consistency between disagreeing experts. By observing how LLM judges agree and disagree while grading other LLMs, we can compute the only possible evaluations of their grading ability. For example, if two LLM judges disagree on which tasks a third one completed correctly, they cannot both be 100\% correct in their judgments. This logic can be formalized as a Linear Programming problem in the space of integer response counts for any finite test. We use it here to develop no-knowledge alarms for misaligned LLM judges. The alarms can detect, with no false positives, that at least one member or more of an ensemble of judges are violating a user specified grading ability requirement.