ASSDMar 30, 2021

MediaSpeech: Multilanguage ASR Benchmark and Dataset

arXiv:2103.16193v132 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited and proprietary ASR evaluation resources for researchers and vendors, though it is incremental as it extends existing dataset creation efforts to new languages and domains.

The authors tackled the lack of open-source, multilingual ASR evaluation datasets by creating MediaSpeech, a 10-hour dataset in Spanish, French, Turkish, and Arabic with an estimated WER under 5%, and benchmarked various ASR systems while providing baseline models.

The performance of automated speech recognition (ASR) systems is well known to differ for varied application domains. At the same time, vendors and research groups typically report ASR quality results either for limited use simplistic domains (audiobooks, TED talks), or proprietary datasets. To fill this gap, we provide an open-source 10-hour ASR system evaluation dataset NTR MediaSpeech for 4 languages: Spanish, French, Turkish and Arabic. The dataset was collected from the official youtube channels of media in the respective languages, and manually transcribed. We estimate that the WER of the dataset is under 5%. We have benchmarked many ASR systems available both commercially and freely, and provide the benchmark results. We also open-source baseline QuartzNet models for each language.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes