Youngjoon Yu

CV
h-index19
9papers
92citations
Novelty53%
AI Score32

9 Papers

CVAug 22, 2024Code
SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models

Youngjoon Yu, Sangyun Chung, Byung-Kwan Lee et al.

Large-scale Vision-Language Models (LVLMs) have significantly advanced with text-aligned vision inputs. They have made remarkable progress in computer vision tasks by aligning text modality with vision inputs. There are also endeavors to incorporate multi-vision sensors beyond RGB, including thermal, depth, and medical X-ray images. However, we observe that current LVLMs view images taken from multi-vision sensors as if they were in the same RGB domain without considering the physical characteristics of multi-vision sensors. They fail to convey the fundamental multi-vision sensor information from the dataset and the corresponding contextual knowledge properly. Consequently, alignment between the information from the actual physical environment and the text is not achieved correctly, making it difficult to answer complex sensor-related questions that consider the physical environment. In this paper, we aim to establish a multi-vision Sensor Perception And Reasoning benchmarK called SPARK that can reduce the fundamental multi-vision sensor information gap between images and multi-vision sensors. We generated 6,248 vision-language test samples to investigate multi-vision sensory perception and multi-vision sensory reasoning on physical sensor knowledge proficiency across different formats, covering different types of sensor-related questions. We utilized these samples to assess ten leading LVLMs. The results showed that most models displayed deficiencies in multi-vision sensory reasoning to varying extents. Codes and data are available at https://github.com/top-yun/SPARK

CVApr 27, 2022
Defending Person Detection Against Adversarial Patch Attack by using Universal Defensive Frame

Youngjoon Yu, Hong Joo Lee, Hakmin Lee et al.

Person detection has attracted great attention in the computer vision area and is an imperative element in human-centric computer vision. Although the predictive performances of person detection networks have been improved dramatically, they are vulnerable to adversarial patch attacks. Changing the pixels in a restricted region can easily fool the person detection network in safety-critical applications such as autonomous driving and security systems. Despite the necessity of countering adversarial patch attacks, very few efforts have been dedicated to defending person detection against adversarial patch attack. In this paper, we propose a novel defense strategy that defends against an adversarial patch attack by optimizing a defensive frame for person detection. The defensive frame alleviates the effect of the adversarial patch while maintaining person detection performance with clean person. The proposed defensive frame in the person detection is generated with a competitive learning algorithm which makes an iterative competition between detection threatening module and detection shielding module in person detection. Comprehensive experimental results demonstrate that the proposed method effectively defends person detection against adversarial patch attacks.

CVJun 27, 2023
Advancing Adversarial Training by Injecting Booster Signal

Hong Joo Lee, Youngjoon Yu, Yong Man Ro

Recent works have demonstrated that deep neural networks (DNNs) are highly vulnerable to adversarial attacks. To defend against adversarial attacks, many defense strategies have been proposed, among which adversarial training has been demonstrated to be the most effective strategy. However, it has been known that adversarial training sometimes hurts natural accuracy. Then, many works focus on optimizing model parameters to handle the problem. Different from the previous approaches, in this paper, we propose a new approach to improve the adversarial robustness by using an external signal rather than model parameters. In the proposed method, a well-optimized universal external signal called a booster signal is injected into the outside of the image which does not overlap with the original content. Then, it boosts both adversarial robustness and natural accuracy. The booster signal is optimized in parallel to model parameters step by step collaboratively. Experimental results show that the booster signal can improve both the natural and robust accuracies over the recent state-of-the-art adversarial training methods. Also, optimizing the booster signal is general and flexible enough to be adopted on any existing adversarial training methods.

LGOct 21, 2022
Meta Input: How to Leverage Off-the-Shelf Deep Neural Networks

Minsu Kim, Youngjoon Yu, Sungjune Park et al.

These days, although deep neural networks (DNNs) have achieved a noticeable progress in a wide range of research area, it lacks the adaptability to be employed in the real-world applications because of the environment discrepancy problem. Such a problem originates from the difference between training and testing environments, and it is widely known that it causes serious performance degradation, when a pretrained DNN model is applied to a new testing environment. Therefore, in this paper, we introduce a novel approach that allows end-users to exploit pretrained DNN models in their own testing environment without modifying the models. To this end, we present a \textit{meta input} which is an additional input transforming the distribution of testing data to be aligned with that of training data. The proposed meta input can be optimized with a small number of testing data only by considering the relation between testing input data and its output prediction. Also, it does not require any knowledge of the network's internal architecture and modification of its weight parameters. Then, the obtained meta input is added to testing data in order to shift the distribution of testing data to that of originally used training data. As a result, end-users can exploit well-trained models in their own testing environment which can differ from the training environment. We validate the effectiveness and versatility of the proposed meta input by showing the robustness against the environment discrepancy through the comprehensive experiments with various tasks.

CVMar 2, 2024
Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection

Taeheon Kim, Sebin Shin, Youngjoon Yu et al.

RGBT multispectral pedestrian detection has emerged as a promising solution for safety-critical applications that require day/night operations. However, the modality bias problem remains unsolved as multispectral pedestrian detectors learn the statistical bias in datasets. Specifically, datasets in multispectral pedestrian detection mainly distribute between ROTO (day) and RXTO (night) data; the majority of the pedestrian labels statistically co-occur with their thermal features. As a result, multispectral pedestrian detectors show poor generalization ability on examples beyond this statistical correlation, such as ROTX data. To address this problem, we propose a novel Causal Mode Multiplexer (CMM) framework that effectively learns the causalities between multispectral inputs and predictions. Moreover, we construct a new dataset (ROTX-MP) to evaluate modality bias in multispectral pedestrian detection. ROTX-MP mainly includes ROTX examples not presented in previous datasets. Extensive experiments demonstrate that our proposed CMM framework generalizes well on existing datasets (KAIST, CVC-14, FLIR) and the new ROTX-MP. We will release our new dataset to the public for future research.

CVMar 22, 2024
MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection

Taeheon Kim, Sangyun Chung, Damin Yeom et al.

Multispectral pedestrian detection is attractive for around-the-clock applications due to the complementary information between RGB and thermal modalities. However, current models often fail to detect pedestrians in certain cases (e.g., thermal-obscured pedestrians), particularly due to the modality bias learned from statistically biased datasets. In this paper, we investigate how to mitigate modality bias in multispectral pedestrian detection using Large Language Models (LLMs). Accordingly, we design a Multispectral Chain-of-Thought (MSCoT) prompting strategy, which prompts the LLM to perform multispectral pedestrian detection. Moreover, we propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework that integrates MSCoT prompting into multispectral pedestrian detection. To this end, we design a Language-driven Multi-modal Fusion (LMF) strategy that enables fusing the outputs of MSCoT prompting with the detection results of vision-based multispectral pedestrian detection models. Extensive experiments validate that MSCoTDet effectively mitigates modality biases and improves multispectral pedestrian detection.

CVNov 27, 2024
Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-Driven Approach for Cross-modal Alignment Fusion

Taeheon Kim, Sangyun Chung, Youngjoon Yu et al.

Multispectral pedestrian detection is a crucial component in various critical applications. However, a significant challenge arises due to the misalignment between these modalities, particularly under real-world conditions where data often appear heavily misaligned. Conventional methods developed on well-aligned or minimally misaligned datasets fail to address these discrepancies adequately. This paper introduces a new framework for multispectral pedestrian detection designed specifically to handle heavily misaligned datasets without the need for costly and complex traditional pre-processing calibration. By leveraging Large-scale Vision-Language Models (LVLM) for cross-modal semantic alignment, our approach seeks to enhance detection accuracy by aligning semantic information across the RGB and thermal domains. This method not only simplifies the operational requirements but also extends the practical usability of multispectral detection technologies in practical applications.

CVDec 30, 2024
Enhanced Vision-Language Models for Diverse Sensor Understanding: Cost-Efficient Optimization and Benchmarking

Sangyun Chung, Youngjoon Yu, Se Yeon Kim et al.

Large-scale Vision-Language Models (VLMs) have achieved notable progress in aligning visual inputs with text. However, their ability to deeply understand the unique physical properties of non-RGB vision sensor images remains limited. In this paper, we revisit and analyze these limitations and introduce a novel, cost-efficient paradigm that significantly advances sensor image understanding-without requiring extensive training data or any modifications to the existing VLM architectures. Specifically, we propose Sensor-Aware Attributes Fine-Tuning (SAFT) with the Diverse Negative Attributes (DNA) optimization, which leverages minimal sensor-specific data to enable robust learning of non-RGB characteristics and overcome RGB-centric biases inherent in current VLMs. In addition, we present VS-TDX-the first comprehensive, public benchmark designed to rigorously evaluate VLMs' sensor-specific understanding across diverse and realistic scenarios. Through extensive experiments on VLMs and various sensor modalities, we validate that our method consistently delivers superior performance and generalization under resource-constrained and architecture-invariant settings. Our approach provides a practical advance towards scalable deployment of VLMs in increasingly sensor-diverse real-world environments.

CVMay 22, 2020
Investigating Vulnerability to Adversarial Examples on Multimodal Data Fusion in Deep Learning

Youngjoon Yu, Hong Joo Lee, Byeong Cheon Kim et al.

The success of multimodal data fusion in deep learning appears to be attributed to the use of complementary in-formation between multiple input data. Compared to their predictive performance, relatively less attention has been devoted to the robustness of multimodal fusion models. In this paper, we investigated whether the current multimodal fusion model utilizes the complementary intelligence to defend against adversarial attacks. We applied gradient based white-box attacks such as FGSM and PGD on MFNet, which is a major multispectral (RGB, Thermal) fusion deep learning model for semantic segmentation. We verified that the multimodal fusion model optimized for better prediction is still vulnerable to adversarial attack, even if only one of the sensors is attacked. Thus, it is hard to say that existing multimodal data fusion models are fully utilizing complementary relationships between multiple modalities in terms of adversarial robustness. We believe that our observations open a new horizon for adversarial attack research on multimodal data fusion.