AILGAug 30, 2023

Speech Wikimedia: A 77 Language Multilingual Speech Dataset

arXiv:2308.15710v14 citationsh-index: 18
Originality Synthesis-oriented
AI Analysis

This dataset addresses the need for large-scale, diverse multilingual speech data for researchers and developers in speech and language processing, though it is incremental as it compiles existing resources.

The authors compiled the Speech Wikimedia Dataset, a publicly available multilingual speech dataset with 1780 hours of transcribed audio in 77 languages, extracted from Wikimedia Commons under CC-BY-SA license, to support training models for speech recognition, speech translation, and machine translation.

The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recognition, speech translation, and machine translation models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes