SDASDec 17, 2021

JTubeSpeech: corpus of Japanese speech collected from YouTube for speech recognition and speaker verification

arXiv:2112.09323v130 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This provides a new dataset for Japanese speech recognition and speaker verification, addressing a gap for non-English languages, but it is incremental as it applies existing methods to new data.

The authors tackled the lack of open-source large-scale speech corpora for non-English languages by constructing JTubeSpeech, a Japanese speech corpus from YouTube, resulting in over 1,300 hours for ASR and 900 hours for ASV benchmarks.

In this paper, we construct a new Japanese speech corpus called "JTubeSpeech." Although recent end-to-end learning requires large-size speech corpora, open-sourced such corpora for languages other than English have not yet been established. In this paper, we describe the construction of a corpus from YouTube videos and subtitles for speech recognition and speaker verification. Our method can automatically filter the videos and subtitles with almost no language-dependent processes. We consistently employ Connectionist Temporal Classification (CTC)-based techniques for automatic speech recognition (ASR) and a speaker variation-based method for automatic speaker verification (ASV). We build 1) a large-scale Japanese ASR benchmark with more than 1,300 hours of data and 2) 900 hours of data for Japanese ASV.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes