AS AI CL LG SDJul 18, 2024

Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for Polish

arXiv:2408.00005v11 citationsh-index: 2Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of discoverability and interoperability in speech datasets for researchers and practitioners in ASR, particularly for Polish, though it is incremental as it applies existing curation and evaluation methods to a specific language.

The researchers tackled the underutilization of speech datasets by developing a framework to curate them and evaluate ASR systems, applying it to Polish with over 24 datasets and 25 system-model combinations, resulting in the most extensive comparison to date for Polish ASR.

Speech datasets available in the public domain are often underutilized because of challenges in discoverability and interoperability. A comprehensive framework has been designed to survey, catalog, and curate available speech datasets, which allows replicable evaluation of automatic speech recognition (ASR) systems. A case study focused on the Polish language was conducted; the framework was applied to curate more than 24 datasets and evaluate 25 combinations of ASR systems and models. This research constitutes the most extensive comparison to date of both commercial and free ASR systems for the Polish language. It draws insights from 600 system-model-test set evaluations, marking a significant advancement in both scale and comprehensiveness. The results of surveys and performance comparisons are available as interactive dashboards (https://huggingface.co/spaces/amu-cai/pl-asr-leaderboard) along with curated datasets (https://huggingface.co/datasets/amu-cai/pl-asr-bigos-v2, https://huggingface.co/datasets/pelcra/pl-asr-pelcra-for-bigos) and the open challenge call (https://poleval.pl/tasks/task3). Tools used for evaluation are open-sourced (https://github.com/goodmike31/pl-asr-bigos-tools), facilitating replication and adaptation for other languages, as well as continuous expansion with new datasets and systems.

View on arXiv PDF Code

Similar