CL ASFeb 22, 2025

Understanding Zero-shot Rare Word Recognition Improvements Through LLM Integration

arXiv:2502.16142v12.71 citationsh-index: 1

Originality Incremental advance

AI Analysis

This work addresses the problem of rare word recognition in speech recognition systems, which is incremental as it builds on existing LLM and ASR integration methods.

The study tackled improving rare word recognition in automatic speech recognition by integrating a large language model (LLM) with an ASR system, resulting in the LLM-ASR architecture outperforming traditional models in zero-shot rare word recognition, with specific improvements in rare word error rate (R-WER).

In this study, we investigate the integration of a large language model (LLM) with an automatic speech recognition (ASR) system, specifically focusing on enhancing rare word recognition performance. Using a 190,000-hour dataset primarily sourced from YouTube, pre-processed with Whisper V3 pseudo-labeling, we demonstrate that the LLM-ASR architecture outperforms traditional Zipformer-Transducer models in the zero-shot rare word recognition task, after training on a large dataset. Our analysis reveals that the LLM contributes significantly to improvements in rare word error rate (R-WER), while the speech encoder primarily determines overall transcription performance (Orthographic Word Error Rate, O-WER, and Normalized Word Error Rate, N-WER). Through extensive ablation studies, we highlight the importance of adapter integration in aligning speech encoder outputs with the LLM's linguistic capabilities. Furthermore, we emphasize the critical role of high-quality labeled data in achieving optimal performance. These findings provide valuable insights into the synergy between LLM-based ASR architectures, paving the way for future advancements in large-scale LLM-based speech recognition systems.

View on arXiv PDF

Similar