CLSDASSep 17, 2024

WER We Stand: Benchmarking Urdu ASR Models

arXiv:2409.11252v321 citationsh-index: 16
AI Analysis

It addresses the problem of evaluating ASR for low-resource languages like Urdu, providing insights for developers, but is incremental as it applies existing methods to new data.

This paper benchmarks Urdu automatic speech recognition (ASR) models, finding that seamless-large performs best on read speech and whisper-large on conversational speech, with the first conversational dataset for Urdu ASR.

This paper presents a comprehensive evaluation of Urdu Automatic Speech Recognition (ASR) models. We analyze the performance of three ASR model families: Whisper, MMS, and Seamless-M4T using Word Error Rate (WER), along with a detailed examination of the most frequent wrong words and error types including insertions, deletions, and substitutions. Our analysis is conducted using two types of datasets, read speech and conversational speech. Notably, we present the first conversational speech dataset designed for benchmarking Urdu ASR models. We find that seamless-large outperforms other ASR models on the read speech dataset, while whisper-large performs best on the conversational speech dataset. Furthermore, this evaluation highlights the complexities of assessing ASR models for low-resource languages like Urdu using quantitative metrics alone and emphasizes the need for a robust Urdu text normalization system. Our findings contribute valuable insights for developing robust ASR systems for low-resource languages like Urdu.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes