CLOct 23, 2023

Establishing Vocabulary Tests as a Benchmark for Evaluating Large Language Models

arXiv:2310.14703v25 citationsh-index: 9
Originality Synthesis-oriented
AI Analysis

This addresses the need for better evaluation of LLMs' fundamental linguistic skills, though it is incremental as it revives an existing method.

The paper tackles the problem of evaluating Large Language Models (LLMs) by advocating for vocabulary tests as a benchmark, revealing gaps in lexical knowledge across seven models and two languages.

Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific tasks or domain-specific knowledge, they often neglect the fundamental linguistic aspects of language understanding and production. In this paper, we advocate for the revival of vocabulary tests as a valuable tool for assessing LLM performance. We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge. These findings shed light on the intricacies of LLM word representations, their learning mechanisms, and performance variations across models and languages. Moreover, the ability to automatically generate and perform vocabulary tests offers new opportunities to expand the approach and provide a more complete picture of LLMs' language skills.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes