Inyong Koo

CV
h-index6
6papers
68citations
Novelty56%
AI Score46

6 Papers

CVJul 24, 2023Code
PG-RCNN: Semantic Surface Point Generation for 3D Object Detection

Inyong Koo, Inyoung Lee, Se-Ho Kim et al.

One of the main challenges in LiDAR-based 3D object detection is that the sensors often fail to capture the complete spatial information about the objects due to long distance and occlusion. Two-stage detectors with point cloud completion approaches tackle this problem by adding more points to the regions of interest (RoIs) with a pre-trained network. However, these methods generate dense point clouds of objects for all region proposals, assuming that objects always exist in the RoIs. This leads to the indiscriminate point generation for incorrect proposals as well. Motivated by this, we propose Point Generation R-CNN (PG-RCNN), a novel end-to-end detector that generates semantic surface points of foreground objects for accurate detection. Our method uses a jointly trained RoI point generation module to process the contextual information of RoIs and estimate the complete shape and displacement of foreground objects. For every generated point, PG-RCNN assigns a semantic feature that indicates the estimated foreground probability. Extensive experiments show that the point clouds generated by our method provide geometrically and semantically rich information for refining false positive and misaligned proposals. PG-RCNN achieves competitive performance on the KITTI benchmark, with significantly fewer parameters than state-of-the-art models. The code is available at https://github.com/quotation2520/PG-RCNN.

23.0MMMar 11
Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition

Inyong Koo, yeeun Seong, Minseok Son et al.

Audio-visual emotion recognition (AVER) methods typically fuse utterance-level features, and even frame-level attention models seldom address the frame-rate mismatch across modalities. In this paper, we propose a Transformer-based framework focusing on the temporal alignment of multimodal features. Our design employs a multimodal self-attention encoder that simultaneously captures intra- and inter-modal dependencies within a shared feature space. To address heterogeneous sampling rates, we incorporate Temporally-aligned Rotary Position Embeddings (TaRoPE), which implicitly synchronize audio and video tokens. Furthermore, we introduce a Cross-Temporal Matching (CTM) loss that enforces consistency among temporally proximate pairs, guiding the encoder toward better alignment. Experiments on CREMA-D and RAVDESS datasets demonstrate consistent improvements over recent baselines, suggesting that explicitly addressing frame-rate mismatch helps preserve temporal cues and enhances cross-modal fusion.

CVOct 25, 2023
DiffRef3D: A Diffusion-based Proposal Refinement Framework for 3D Object Detection

Se-Ho Kim, Inyong Koo, Inyoung Lee et al.

Denoising diffusion models show remarkable performances in generative tasks, and their potential applications in perception tasks are gaining interest. In this paper, we introduce a novel framework named DiffRef3D which adopts the diffusion process on 3D object detection with point clouds for the first time. Specifically, we formulate the proposal refinement stage of two-stage 3D object detectors as a conditional diffusion process. During training, DiffRef3D gradually adds noise to the residuals between proposals and target objects, then applies the noisy residuals to proposals to generate hypotheses. The refinement module utilizes these hypotheses to denoise the noisy residuals and generate accurate box predictions. In the inference phase, DiffRef3D generates initial hypotheses by sampling noise from a Gaussian distribution as residuals and refines the hypotheses through iterative steps. DiffRef3D is a versatile proposal refinement framework that consistently improves the performance of existing 3D object detection models. We demonstrate the significance of DiffRef3D through extensive experiments on the KITTI benchmark. Code will be available.

CVJan 25, 2022Code
Explore-And-Match: Bridging Proposal-Based and Proposal-Free With Transformer for Sentence Grounding in Videos

Sangmin Woo, Jinyoung Park, Inyong Koo et al.

Natural Language Video Grounding (NLVG) aims to localize time segments in an untrimmed video according to sentence queries. In this work, we present a new paradigm named Explore-And-Match for NLVG that seamlessly unifies the strengths of two streams of NLVG methods: proposal-free and proposal-based; the former explores the search space to find time segments directly, and the latter matches the predefined time segments with ground truths. To achieve this, we formulate NLVG as a set prediction problem and design an end-to-end trainable Language Video Transformer (LVTR) that can enjoy two favorable properties, which are rich contextualization power and parallel decoding. We train LVTR with two losses. First, temporal localization loss allows time segments of all queries to regress targets (explore). Second, set guidance loss couples every query with their respective target (match). To our surprise, we found that training schedule shows divide-and-conquer-like pattern: time segments are first diversified regardless of the target, then coupled with each target, and fine-tuned to the target again. Moreover, LVTR is highly efficient and effective: it infers faster than previous baselines (by 2X or more) and sets competitive results on two NLVG benchmarks (ActivityCaptions and Charades-STA). Codes are available at https://github.com/sangminwoo/Explore-And-Match.

CVDec 20, 2024
Difficulty-aware Balancing Margin Loss for Long-tailed Recognition

Minseok Son, Inyong Koo, Jinyoung Park et al.

When trained with severely imbalanced data, deep neural networks often struggle to accurately recognize classes with only a few samples. Previous studies in long-tailed recognition have attempted to rebalance biased learning using known sample distributions, primarily addressing different classification difficulties at the class level. However, these approaches often overlook the instance difficulty variation within each class. In this paper, we propose a difficulty-aware balancing margin (DBM) loss, which considers both class imbalance and instance difficulty. DBM loss comprises two components: a class-wise margin to mitigate learning bias caused by imbalanced class frequencies, and an instance-wise margin assigned to hard positive samples based on their individual difficulty. DBM loss improves class discriminativity by assigning larger margins to more difficult samples. Our method seamlessly combines with existing approaches and consistently improves performance across various long-tailed recognition benchmarks.

CVMay 25, 2021
Improving Few-shot Learning with Weakly-supervised Object Localization

Inyong Koo, Minki Jeong, Changick Kim

Few-shot learning often involves metric learning-based classifiers, which predict the image label by comparing the distance between the extracted feature vector and class representations. However, applying global pooling in the backend of the feature extractor may not produce an embedding that correctly focuses on the class object. In this work, we propose a novel framework that generates class representations by extracting features from class-relevant regions of the images. Given only a few exemplary images with image-level labels, our framework first localizes the class objects by spatially decomposing the similarity between the images and their class prototypes. Then, enhanced class representations are achieved from the localization results. We also propose a loss function to enhance distinctions of the refined features. Our method outperforms the baseline few-shot model in miniImageNet and tieredImageNet benchmarks.