AS CLMay 28, 2025

Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition

Yuan Tseng, Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya

Cambridge

arXiv:2505.22251v28.06 citationsh-index: 13

Originality Synthesis-oriented

AI Analysis

This exposes flawed evaluation practices in speech recognition research, potentially misleading claims about LLM improvements.

The paper reveals that evaluation sets for speech recognition (LibriSpeech and Common Voice) are contaminated in LLM pretraining data, undermining reported performance gains. Experiments show contaminated LLMs assign higher probabilities to seen transcriptions, though error rates change only subtly.

Recent work suggests that large language models (LLMs) can improve performance of speech tasks compared to existing systems. To support their claims, results on LibriSpeech and Common Voice are often quoted. However, this work finds that a substantial amount of the LibriSpeech and Common Voice evaluation sets appear in public LLM pretraining corpora. This calls into question the reliability of findings drawn from these two datasets. To measure contamination impact, LLMs trained with/without contamination are compared. A contaminated LLM is more likely to generate test sentences it has seen during training. Then, speech recognisers based on LLMs are compared. They show only subtle error rate differences if the LLM is contaminated, but assign significantly higher probabilities to transcriptions seen during LLM training. Results show that LLM outputs can be biased by tiny amounts of data contamination, highlighting the importance of evaluating LLM-based speech systems with held-out data.

View on arXiv PDF

Similar