Rae‐Hong Park

h-index37

4papers

137citations

Novelty43%

AI Score31

Ranked #132,979 of 194,257 authors (top 68%)#43,899 in CV (top 74%)

4 Papers

2.3MMJan 16, 2023Code

OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset

Jeongkyun Park, Jung-Wook Hwang, Kwanghee Choi et al.

Inspired by humans comprehending speech in a multi-modal manner, various audio-visual datasets have been constructed. However, most existing datasets focus on English, induce dependencies with various prediction models during dataset preparation, and have only a small number of multi-view videos. To mitigate the limitations, we recently developed the Open Large-scale Korean Audio-Visual Speech (OLKAVS) dataset, which is the largest among publicly available audio-visual speech datasets. The dataset contains 1,150 hours of transcribed audio from 1,107 Korean speakers in a studio setup with nine different viewpoints and various noise situations. We also provide the pre-trained baseline models for two tasks, audio-visual speech recognition and lip reading. We conducted experiments based on the models to verify the effectiveness of multi-modal and multi-view training over uni-modal and frontal-view-only training. We expect the OLKAVS dataset to facilitate multi-modal research in broader areas such as Korean speech recognition, speaker recognition, pronunciation level classification, and mouth motion analysis.

8.4CVMay 7, 2025

SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer

Young-Hu Park, Rae-Hong Park, Hyung-Min Park

This paper presents an efficient visual speech encoder for lip reading. While most recent lip reading studies have been based on the ResNet architecture and have achieved significant success, they are not sufficiently suitable for efficiently capturing lip reading features due to high computational complexity in modeling spatio-temporal information. Additionally, using a complex visual model not only increases the complexity of lip reading models but also induces delays in the overall network for multi-modal studies (e.g., audio-visual speech recognition, speech enhancement, and speech separation). To overcome the limitations of Convolutional Neural Network (CNN)-based models, we apply the hierarchical structure and window self-attention of the Swin Transformer to lip reading. We configure a new lightweight scale of the Swin Transformer suitable for processing lip reading data and present the SwinLip visual speech encoder, which efficiently reduces computational load by integrating modified Convolution-augmented Transformer (Conformer) temporal embeddings with conventional spatial embeddings in the hierarchical structure. Through extensive experiments, we have validated that our SwinLip successfully improves the performance and inference speed of the lip reading network when applied to various backbones for word and sentence recognition, reducing computational load. In particular, our SwinLip demonstrated robust performance in both English LRW and Mandarin LRW-1000 datasets and achieved state-of-the-art performance on the Mandarin LRW-1000 dataset with less computation compared to the existing state-of-the-art model.

3.6CVSep 14, 2015

Color-Phase Analysis for Sinusoidal Structured Light in Rapid Range Imaging

Changsoo Je, Sang Wook Lee, Rae-Hong Park

Active range sensing using structured-light is the most accurate and reliable method for obtaining 3D information. However, most of the work has been limited to range sensing of static objects, and range sensing of dynamic (moving or deforming) objects has been investigated recently only by a few researchers. Sinusoidal structured-light is one of the well-known optical methods for 3D measurement. In this paper, we present a novel method for rapid high-resolution range imaging using color sinusoidal pattern. We consider the real-world problem of nonlinearity and color-band crosstalk in the color light projector and color camera, and present methods for accurate recovery of color-phase. For high-resolution ranging, we use high-frequency patterns and describe new unwrapping algorithms for reliable range recovery. The experimental results demonstrate the effectiveness of our methods.

5.4CVAug 20, 2015

High-Contrast Color-Stripe Pattern for Rapid Structured-Light Range Imaging

Changsoo Je, Sang Wook Lee, Rae-Hong Park

For structured-light range imaging, color stripes can be used for increasing the number of distinguishable light patterns compared to binary BW stripes. Therefore, an appropriate use of color patterns can reduce the number of light projections and range imaging is achievable in single video frame or in "one shot". On the other hand, the reliability and range resolution attainable from color stripes is generally lower than those from multiply projected binary BW patterns since color contrast is affected by object color reflectance and ambient light. This paper presents new methods for selecting stripe colors and designing multiple-stripe patterns for "one-shot" and "two-shot" imaging. We show that maximizing color contrast between the stripes in one-shot imaging reduces the ambiguities resulting from colored object surfaces and limitations in sensor/projector resolution. Two-shot imaging adds an extra video frame and maximizes the color contrast between the first and second video frames to diminish the ambiguities even further. Experimental results demonstrate the effectiveness of the presented one-shot and two-shot color-stripe imaging schemes.