ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus
This addresses the problem of limited speech resources for Yorùbá language technology, though it is incremental as it builds on existing data collection methods.
The authors tackled the lack of high-quality Yorùbá speech data by creating ÌròyìnSpeech, a multi-purpose corpus with about 48 hours of recordings, enabling a TTS system with as little as 5 hours of speech and achieving a baseline ASR word error rate of 23.8.
We introduce ÌròyìnSpeech, a new corpus influenced by the desire to increase the amount of high quality, contemporary Yorùbá speech data, which can be used for both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) tasks. We curated about 23000 text sentences from news and creative writing domains with the open license CC-BY-4.0. To encourage a participatory approach to data creation, we provide 5000 curated sentences to the Mozilla Common Voice platform to crowd-source the recording and validation of Yorùbá speech data. In total, we created about 42 hours of speech data recorded by 80 volunteers in-house, and 6 hours of validated recordings on Mozilla Common Voice platform. Our TTS evaluation suggests that a high-fidelity, general domain, single-speaker Yorùbá voice is possible with as little as 5 hours of speech. Similarly, for ASR we obtained a baseline word error rate (WER) of 23.8.