SD ASDec 17, 2021

JTubeSpeech: corpus of Japanese speech collected from YouTube for speech recognition and speaker verification

Shinnosuke Takamichi, Ludwig Kürzinger, Takaaki Saeki, Sayaka Shiota, Shinji Watanabe

arXiv:2112.09323v111.730 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This provides a new dataset for Japanese speech recognition and speaker verification, addressing a gap for non-English languages, but it is incremental as it applies existing methods to new data.

The authors tackled the lack of open-source large-scale speech corpora for non-English languages by constructing JTubeSpeech, a Japanese speech corpus from YouTube, resulting in over 1,300 hours for ASR and 900 hours for ASV benchmarks.

In this paper, we construct a new Japanese speech corpus called "JTubeSpeech." Although recent end-to-end learning requires large-size speech corpora, open-sourced such corpora for languages other than English have not yet been established. In this paper, we describe the construction of a corpus from YouTube videos and subtitles for speech recognition and speaker verification. Our method can automatically filter the videos and subtitles with almost no language-dependent processes. We consistently employ Connectionist Temporal Classification (CTC)-based techniques for automatic speech recognition (ASR) and a speaker variation-based method for automatic speaker verification (ASV). We build 1) a large-scale Japanese ASR benchmark with more than 1,300 hours of data and 2) 900 hours of data for Japanese ASV.

View on arXiv PDF

Similar