Holmes: A Benchmark to Assess the Linguistic Competence of Language Models
This addresses the need to disentangle linguistic competence from other cognitive abilities in language model evaluation, though it is incremental as it builds on existing probing studies.
The authors introduced Holmes, a benchmark to assess language models' linguistic competence through classifier-based probing of over 200 datasets across syntax, morphology, semantics, reasoning, and discourse. Analysis of over 50 models showed that linguistic competence correlates with model size, but surprisingly, model architecture and instruction tuning also significantly affect performance, particularly in morphology and syntax.
We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence - their unconscious understanding of linguistic phenomena. Specifically, we use classifier-based probing to examine LMs' internal representations regarding distinct linguistic phenomena (e.g., part-of-speech tagging). As a result, we meet recent calls to disentangle LMs' linguistic competence from other cognitive abilities, such as following instructions in prompting-based evaluations. Composing Holmes, we review over 270 probing studies and include more than 200 datasets to assess syntax, morphology, semantics, reasoning, and discourse phenomena. Analyzing over 50 LMs reveals that, aligned with known trends, their linguistic competence correlates with model size. However, surprisingly, model architecture and instruction tuning also significantly influence performance, particularly in morphology and syntax. Finally, we propose FlashHolmes, a streamlined version that reduces the computation load while maintaining high-ranking precision.