CLOct 14, 2020

Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview

Alena Butryna, Shan-Hui Cathy Chu, Isin Demirsahin, Alexander Gutkin, Linne Ha, Fei He, Martin Jansche, Cibu Johny, Anna Katanova, Oddur Kjartansson, Chenfang Li, Tatiana Merkulova

arXiv:2010.06778v11.318 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work provides essential data for low-resource language communities, enabling speech technology development, but it is incremental as it builds on existing crowdsourcing methods.

The paper addresses the lack of freely available speech resources for under-represented languages by releasing 38 datasets for text-to-speech and automatic speech recognition applications across multiple regions, including South and Southeast Asia, Africa, Europe, and South America.

This paper presents an overview of a program designed to address the growing need for developing freely available speech resources for under-represented languages. At present we have released 38 datasets for building text-to-speech and automatic speech recognition applications for languages and dialects of South and Southeast Asia, Africa, Europe and South America. The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language communities.

View on arXiv PDF Code

Similar