Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests
This work addresses the challenge of effectively assessing LLM progress for researchers and developers, proposing new evaluation paradigms, but it is incremental as it builds on existing methods without introducing a novel model or breakthrough.
The study tackled the problem of evaluating large language models (LLMs) by comparing standard benchmarks, interactive games, and cognitive tests, finding that interactive games are superior at discriminating model quality and that different cognitive abilities correlate variably with these evaluation methods.
We examine three evaluation paradigms: standard benchmarks (e.g., MMLU and BBH), interactive games (e.g., Signalling Games or Taboo), and cognitive tests (e.g., for working memory or theory of mind). First, we investigate which of the former two-benchmarks or games-is most effective at discriminating LLMs of varying quality. Then, inspired by human cognitive assessments, we compile a suite of targeted tests that measure cognitive abilities deemed essential for effective language use, and we investigate their correlation with model performance in benchmarks and games. Our analyses reveal that interactive games are superior to standard benchmarks in discriminating models. Causal and logical reasoning correlate with both static and interactive tests, while differences emerge regarding core executive functions and social/emotional skills, which correlate more with games. We advocate for the development of new interactive benchmarks and targeted cognitive tasks inspired by assessing human abilities but designed specifically for LLMs.