MMAICLCVLGSDJan 16, 2023

OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset

arXiv:2301.06375v24 citationsh-index: 20
Originality Synthesis-oriented
AI Analysis

This provides a foundational resource for Korean multi-modal speech research, addressing a gap in non-English datasets, though it is incremental as it extends existing dataset concepts to a new language and scale.

The authors tackled the lack of large-scale, multi-view Korean audio-visual speech datasets by creating OLKAVS, which contains 1,150 hours of transcribed audio from 1,107 speakers with nine viewpoints and noise variations, and showed that multi-modal and multi-view training outperforms uni-modal and frontal-view-only approaches.

Inspired by humans comprehending speech in a multi-modal manner, various audio-visual datasets have been constructed. However, most existing datasets focus on English, induce dependencies with various prediction models during dataset preparation, and have only a small number of multi-view videos. To mitigate the limitations, we recently developed the Open Large-scale Korean Audio-Visual Speech (OLKAVS) dataset, which is the largest among publicly available audio-visual speech datasets. The dataset contains 1,150 hours of transcribed audio from 1,107 Korean speakers in a studio setup with nine different viewpoints and various noise situations. We also provide the pre-trained baseline models for two tasks, audio-visual speech recognition and lip reading. We conducted experiments based on the models to verify the effectiveness of multi-modal and multi-view training over uni-modal and frontal-view-only training. We expect the OLKAVS dataset to facilitate multi-modal research in broader areas such as Korean speech recognition, speaker recognition, pronunciation level classification, and mouth motion analysis.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes