Resurrecting saturated LLM benchmarks with adversarial encoding
This addresses the issue of benchmark saturation for LLM evaluators, offering a method to resurrect old benchmarks, though it is incremental as it builds on known adversarial techniques.
The study tackled the problem of saturated LLM benchmarks by introducing adversarial encoding techniques like pairing questions and adding answer options, which predictably reduced performance for more capable models on WMDP-bio, GPQA, and MMLU variants, effectively heightening the performance ceiling and unsaturating benchmarks.
Recent work showed that small changes in benchmark questions can reduce LLMs' reasoning and recall. We explore two such changes: pairing questions and adding more answer options, on three benchmarks: WMDP-bio, GPQA, and MMLU variants. We find that for more capable models, these predictably reduce performance, essentially heightening the performance ceiling of a benchmark and unsaturating it again. We suggest this approach can resurrect old benchmarks.