CLSDASSep 18, 2024

ASR Benchmarking: Need for a More Representative Conversational Dataset

arXiv:2409.12042v12 citationsh-index: 3
Originality Synthesis-oriented
AI Analysis

This addresses the need for more realistic benchmarks in speech recognition, particularly for conversational AI applications, but is incremental as it primarily introduces a new dataset.

The study tackled the problem that existing ASR benchmarks like LibriSpeech fail to capture real-world conversational complexities, and found that state-of-the-art ASR models show a significant performance drop on a new multilingual conversational dataset derived from TalkBank, with Word Error Rate correlating with speech disfluencies.

Automatic Speech Recognition (ASR) systems have achieved remarkable performance on widely used benchmarks such as LibriSpeech and Fleurs. However, these benchmarks do not adequately reflect the complexities of real-world conversational environments, where speech is often unstructured and contains disfluencies such as pauses, interruptions, and diverse accents. In this study, we introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversation between adults. Our results show a significant performance drop across various state-of-the-art ASR models when tested in conversational settings. Furthermore, we observe a correlation between Word Error Rate and the presence of speech disfluencies, highlighting the critical need for more realistic, conversational ASR benchmarks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes