Hak Gu Kim

CV
h-index19
14papers
516citations
Novelty51%
AI Score42

14 Papers

CVMar 2, 2024
Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection

Taeheon Kim, Sebin Shin, Youngjoon Yu et al.

RGBT multispectral pedestrian detection has emerged as a promising solution for safety-critical applications that require day/night operations. However, the modality bias problem remains unsolved as multispectral pedestrian detectors learn the statistical bias in datasets. Specifically, datasets in multispectral pedestrian detection mainly distribute between ROTO (day) and RXTO (night) data; the majority of the pedestrian labels statistically co-occur with their thermal features. As a result, multispectral pedestrian detectors show poor generalization ability on examples beyond this statistical correlation, such as ROTX data. To address this problem, we propose a novel Causal Mode Multiplexer (CMM) framework that effectively learns the causalities between multispectral inputs and predictions. Moreover, we construct a new dataset (ROTX-MP) to evaluate modality bias in multispectral pedestrian detection. ROTX-MP mainly includes ROTX examples not presented in previous datasets. Extensive experiments demonstrate that our proposed CMM framework generalizes well on existing datasets (KAIST, CVC-14, FLIR) and the new ROTX-MP. We will release our new dataset to the public for future research.

CVMar 22, 2024
MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection

Taeheon Kim, Sangyun Chung, Damin Yeom et al.

Multispectral pedestrian detection is attractive for around-the-clock applications due to the complementary information between RGB and thermal modalities. However, current models often fail to detect pedestrians in certain cases (e.g., thermal-obscured pedestrians), particularly due to the modality bias learned from statistically biased datasets. In this paper, we investigate how to mitigate modality bias in multispectral pedestrian detection using Large Language Models (LLMs). Accordingly, we design a Multispectral Chain-of-Thought (MSCoT) prompting strategy, which prompts the LLM to perform multispectral pedestrian detection. Moreover, we propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework that integrates MSCoT prompting into multispectral pedestrian detection. To this end, we design a Language-driven Multi-modal Fusion (LMF) strategy that enables fusing the outputs of MSCoT prompting with the detection results of vision-based multispectral pedestrian detection models. Extensive experiments validate that MSCoTDet effectively mitigates modality biases and improves multispectral pedestrian detection.

CVJul 28, 2025
MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization

Hyung Kyu Kim, Sangmin Lee, Hak Gu Kim

Speech-driven 3D facial animation aims to synthesize realistic facial motion sequences from given audio, matching the speaker's speaking style. However, previous works often require priors such as class labels of a speaker or additional 3D facial meshes at inference, which makes them fail to reflect the speaking style and limits their practical use. To address these issues, we propose MemoryTalker which enables realistic and accurate 3D facial motion synthesis by reflecting speaking style only with audio input to maximize usability in applications. Our framework consists of two training stages: 1-stage is storing and retrieving general motion (i.e., Memorizing), and 2-stage is to perform the personalized facial motion synthesis (i.e., Animating) with the motion memory stylized by the audio-driven speaking style feature. In this second stage, our model learns about which facial motion types should be emphasized for a particular piece of audio. As a result, our MemoryTalker can generate a reliable personalized facial animation without additional prior information. With quantitative and qualitative evaluations, as well as user study, we show the effectiveness of our model and its performance enhancement for personalized facial animation over state-of-the-art methods.

CVJul 28, 2025
Learning Phonetic Context-Dependent Viseme for Enhancing Speech-Driven 3D Facial Animation

Hyung Kyu Kim, Hak Gu Kim

Speech-driven 3D facial animation aims to generate realistic facial movements synchronized with audio. Traditional methods primarily minimize reconstruction loss by aligning each frame with ground-truth. However, this frame-wise approach often fails to capture the continuity of facial motion, leading to jittery and unnatural outputs due to coarticulation. To address this, we propose a novel phonetic context-aware loss, which explicitly models the influence of phonetic context on viseme transitions. By incorporating a viseme coarticulation weight, we assign adaptive importance to facial movements based on their dynamic changes over time, ensuring smoother and perceptually consistent animations. Extensive experiments demonstrate that replacing the conventional reconstruction loss with ours improves both quantitative metrics and visual quality. It highlights the importance of explicitly modeling phonetic context-dependent visemes in synthesizing natural speech-driven 3D facial animation. Project page: https://cau-irislab.github.io/interspeech25/

CVApr 14, 2021
Visual Comfort Aware-Reinforcement Learning for Depth Adjustment of Stereoscopic 3D Images

Hak Gu Kim, Minho Park, Sangmin Lee et al.

Depth adjustment aims to enhance the visual experience of stereoscopic 3D (S3D) images, which accompanied with improving visual comfort and depth perception. For a human expert, the depth adjustment procedure is a sequence of iterative decision making. The human expert iteratively adjusts the depth until he is satisfied with the both levels of visual comfort and the perceived depth. In this work, we present a novel deep reinforcement learning (DRL)-based approach for depth adjustment named VCA-RL (Visual Comfort Aware Reinforcement Learning) to explicitly model human sequential decision making in depth editing operations. We formulate the depth adjustment process as a Markov decision process where actions are defined as camera movement operations to control the distance between the left and right cameras. Our agent is trained based on the guidance of an objective visual comfort assessment metric to learn the optimal sequence of camera movement actions in terms of perceptual aspects in stereoscopic viewing. With extensive experiments and user studies, we show the effectiveness of our VCA-RL model on three different S3D databases.

CVApr 14, 2021
Towards a Better Understanding of VR Sickness: Physical Symptom Prediction for VR Contents

Hak Gu Kim, Sangmin Lee, Seongyeop Kim et al.

We address the black-box issue of VR sickness assessment (VRSA) by evaluating the level of physical symptoms of VR sickness. For the VR contents inducing the similar VR sickness level, the physical symptoms can vary depending on the characteristics of the contents. Most of existing VRSA methods focused on assessing the overall VR sickness score. To make better understanding of VR sickness, it is required to predict and provide the level of major symptoms of VR sickness rather than overall degree of VR sickness. In this paper, we predict the degrees of main physical symptoms affecting the overall degree of VR sickness, which are disorientation, nausea, and oculomotor. In addition, we introduce a new large-scale dataset for VRSA including 360 videos with various frame rates, physiological signals, and subjective scores. On VRSA benchmark and our newly collected dataset, our approach shows a potential to not only achieve the highest correlation with subjective scores, but also to better understand which symptoms are the main causes of VR sickness.

CVApr 2, 2021
Video Prediction Recalling Long-term Motion Context via Memory Alignment Learning

Sangmin Lee, Hak Gu Kim, Dae Hwi Choi et al.

Our work addresses long-term motion context issues for predicting future frames. To predict the future precisely, it is required to capture which long-term motion context (e.g., walking or running) the input motion (e.g., leg movement) belongs to. The bottlenecks arising when dealing with the long-term motion context are: (i) how to predict the long-term motion context naturally matching input sequences with limited dynamics, (ii) how to predict the long-term motion context with high-dimensionality (e.g., complex motion). To address the issues, we propose novel motion context-aware video prediction. To solve the bottleneck (i), we introduce a long-term motion context memory (LMC-Memory) with memory alignment learning. The proposed memory alignment learning enables to store long-term motion contexts into the memory and to match them with sequences including limited dynamics. As a result, the long-term context can be recalled from the limited input sequence. In addition, to resolve the bottleneck (ii), we propose memory query decomposition to store local motion context (i.e., low-dimensional dynamics) and recall the suitable local context for each local part of the input individually. It enables to boost the alignment effects of the memory. Experimental results show that the proposed method outperforms other sophisticated RNN-based methods, especially in long-term condition. Further, we validate the effectiveness of the proposed network designs by conducting ablation studies and memory feature analysis. The source code of this work is available.

CVJul 2, 2019
Generative Guiding Block: Synthesizing Realistic Looking Variants Capable of Even Large Change Demands

Minho Park, Hak Gu Kim, Yong Man Ro

Realistic image synthesis is to generate an image that is perceptually indistinguishable from an actual image. Generating realistic looking images with large variations (e.g., large spatial deformations and large pose change), however, is very challenging. Handing large variations as well as preserving appearance needs to be taken into account in the realistic looking image generation. In this paper, we propose a novel realistic looking image synthesis method, especially in large change demands. To do that, we devise generative guiding blocks. The proposed generative guiding block includes realistic appearance preserving discriminator and naturalistic variation transforming discriminator. By taking the proposed generative guiding blocks into generative model, the latent features at the layer of generative model are enhanced to synthesize both realistic looking- and target variation- image. With qualitative and quantitative evaluation in experiments, we demonstrated the effectiveness of the proposed generative guiding blocks, compared to the state-of-the-arts.

CVMay 23, 2018
ICADx: Interpretable computer aided diagnosis of breast masses

Seong Tae Kim, Hakmin Lee, Hak Gu Kim et al.

In this study, a novel computer aided diagnosis (CADx) framework is devised to investigate interpretability for classifying breast masses. Recently, a deep learning technology has been successfully applied to medical image analysis including CADx. Existing deep learning based CADx approaches, however, have a limitation in explaining the diagnostic decision. In real clinical practice, clinical decisions could be made with reasonable explanation. So current deep learning approaches in CADx are limited in real world deployment. In this paper, we investigate interpretability in CADx with the proposed interpretable CADx (ICADx) framework. The proposed framework is devised with a generative adversarial network, which consists of interpretable diagnosis network and synthetic lesion generative network to learn the relationship between malignancy and a standardized description (BI-RADS). The lesion generative network and the interpretable diagnosis network compete in an adversarial learning so that the two networks are improved. The effectiveness of the proposed method was validated on public mammogram database. Experimental results showed that the proposed ICADx framework could provide the interpretability of mass as well as mass classification. It was mainly attributed to the fact that the proposed method was effectively trained to find the relationship between malignancy and interpretations via the adversarial learning. These results imply that the proposed ICADx framework could be a promising approach to develop the CADx system.

CVApr 23, 2018
STAN: Spatio-Temporal Adversarial Networks for Abnormal Event Detection

Sangmin Lee, Hak Gu Kim, Yong Man Ro

In this paper, we propose a novel abnormal event detection method with spatio-temporal adversarial networks (STAN). We devise a spatio-temporal generator which synthesizes an inter-frame by considering spatio-temporal characteristics with bidirectional ConvLSTM. A proposed spatio-temporal discriminator determines whether an input sequence is real-normal or not with 3D convolutional layers. These two networks are trained in an adversarial way to effectively encode spatio-temporal features of normal patterns. After the learning, the generator and the discriminator can be independently used as detectors, and deviations from the learned normal patterns are detected as abnormalities. Experimental results show that the proposed method achieved competitive performance compared to the state-of-the-art methods. Further, for the interpretation, we visualize the location of abnormal events detected by the proposed networks using a generator loss and discriminator gradients.

CVApr 11, 2018
VR IQA NET: Deep Virtual Reality Image Quality Assessment using Adversarial Learning

Heoun-taek Lim, Hak Gu Kim, Yong Man Ro

In this paper, we propose a novel virtual reality image quality assessment (VR IQA) with adversarial learning for omnidirectional images. To take into account the characteristics of the omnidirectional image, we devise deep networks including novel quality score predictor and human perception guider. The proposed quality score predictor automatically predicts the quality score of distorted image using the latent spatial and position feature. The proposed human perception guider criticizes the predicted quality score of the predictor with the human perceptual score using adversarial learning. For evaluation, we conducted extensive subjective experiments with omnidirectional image dataset. Experimental results show that the proposed VR IQA metric outperforms the 2-D IQA and the state-of-the-arts VR IQA.

CVApr 11, 2018
Measurement of exceptional motion in VR video contents for VR sickness assessment using deep convolutional autoencoder

Hak Gu Kim, Wissam J. Baddar, Heoun-taek Lim et al.

This paper proposes a new objective metric of exceptional motion in VR video contents for VR sickness assessment. In VR environment, VR sickness can be caused by several factors which are mismatched motion, field of view, motion parallax, viewing angle, etc. Similar to motion sickness, VR sickness can induce a lot of physical symptoms such as general discomfort, headache, stomach awareness, nausea, vomiting, fatigue, and disorientation. To address the viewing safety issues in virtual environment, it is of great importance to develop an objective VR sickness assessment method that predicts and analyses the degree of VR sickness induced by the VR content. The proposed method takes into account motion information that is one of the most important factors in determining the overall degree of VR sickness. In this paper, we detect the exceptional motion that is likely to induce VR sickness. Spatio-temporal features of the exceptional motion in the VR video content are encoded using a convolutional autoencoder. For objectively assessing the VR sickness, the level of exceptional motion in VR video content is measured by using the convolutional autoencoder as well. The effectiveness of the proposed method has been successfully evaluated by subjective assessment experiment using simulator sickness questionnaires (SSQ) in VR environment.

CVAug 11, 2017
Iterative Deep Convolutional Encoder-Decoder Network for Medical Image Segmentation

Jung Uk Kim, Hak Gu Kim, Yong Man Ro

In this paper, we propose a novel medical image segmentation using iterative deep learning framework. We have combined an iterative learning approach and an encoder-decoder network to improve segmentation results, which enables to precisely localize the regions of interest (ROIs) including complex shapes or detailed textures of medical images in an iterative manner. The proposed iterative deep convolutional encoder-decoder network consists of two main paths: convolutional encoder path and convolutional decoder path with iterative learning. Experimental results show that the proposed iterative deep learning framework is able to yield excellent medical image segmentation performances for various medical images. The effectiveness of the proposed method has been proved by comparing with other state-of-the-art medical image segmentation methods.

CVAug 10, 2017
Modality-bridge Transfer Learning for Medical Image Classification

Hak Gu Kim, Yeoreum Choi, Yong Man Ro

This paper presents a new approach of transfer learning-based medical image classification to mitigate insufficient labeled data problem in medical domain. Instead of direct transfer learning from source to small number of labeled target data, we propose a modality-bridge transfer learning which employs the bridge database in the same medical imaging acquisition modality as target database. By learning the projection function from source to bridge and from bridge to target, the domain difference between source (e.g., natural images) and target (e.g., X-ray images) can be mitigated. Experimental results show that the proposed method can achieve a high classification performance even for a small number of labeled target medical images, compared to various transfer learning approaches.