MM AI CL CV LG SDJan 16, 2023

OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset

Jeongkyun Park, Jung-Wook Hwang, Kwanghee Choi, Seung-Hyun Lee, Jun Hwan Ahn, Rae-Hong Park, Hyung-Min Park

arXiv:2301.06375v22.34 citationsh-index: 20Has Code

Originality Synthesis-oriented

AI Analysis

This provides a foundational resource for Korean multi-modal speech research, addressing a gap in non-English datasets, though it is incremental as it extends existing dataset concepts to a new language and scale.

The authors tackled the lack of large-scale, multi-view Korean audio-visual speech datasets by creating OLKAVS, which contains 1,150 hours of transcribed audio from 1,107 speakers with nine viewpoints and noise variations, and showed that multi-modal and multi-view training outperforms uni-modal and frontal-view-only approaches.

Inspired by humans comprehending speech in a multi-modal manner, various audio-visual datasets have been constructed. However, most existing datasets focus on English, induce dependencies with various prediction models during dataset preparation, and have only a small number of multi-view videos. To mitigate the limitations, we recently developed the Open Large-scale Korean Audio-Visual Speech (OLKAVS) dataset, which is the largest among publicly available audio-visual speech datasets. The dataset contains 1,150 hours of transcribed audio from 1,107 Korean speakers in a studio setup with nine different viewpoints and various noise situations. We also provide the pre-trained baseline models for two tasks, audio-visual speech recognition and lip reading. We conducted experiments based on the models to verify the effectiveness of multi-modal and multi-view training over uni-modal and frontal-view-only training. We expect the OLKAVS dataset to facilitate multi-modal research in broader areas such as Korean speech recognition, speaker recognition, pronunciation level classification, and mouth motion analysis.

View on arXiv PDF Code

Similar