YODAS: Youtube-Oriented Dataset for Audio and Speech
This provides a valuable resource for researchers and practitioners in speech and audio processing, enabling broader access to multilingual data, though it is incremental as it builds on existing dataset collection methods.
The authors tackled the lack of large-scale, publicly available multilingual speech datasets by introducing YODAS, a dataset with over 500k hours of speech in more than 100 languages sourced from YouTube, which includes labeled subsets for supervised training and unlabeled subsets for self-supervised learning.
In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube speech datasets. The labeled subsets, including manual or automatic subtitles, facilitate supervised model training. Conversely, the unlabeled subsets are apt for self-supervised learning applications. YODAS is distinctive as the first publicly available dataset of its scale, and it is distributed under a Creative Commons license. We introduce the collection methodology utilized for YODAS, which contributes to the large-scale speech dataset construction. Subsequently, we provide a comprehensive analysis of speech, text contained within the dataset. Finally, we describe the speech recognition baselines over the top-15 languages.