Jisong Kim

CV
h-index34
10papers
621citations
Novelty54%
AI Score46

10 Papers

CVJul 17, 2023
RCM-Fusion: Radar-Camera Multi-Level Fusion for 3D Object Detection

Jisong Kim, Minjae Seong, Geonho Bang et al.

While LiDAR sensors have been successfully applied to 3D object detection, the affordability of radar and camera sensors has led to a growing interest in fusing radars and cameras for 3D object detection. However, previous radar-camera fusion models were unable to fully utilize the potential of radar information. In this paper, we propose Radar-Camera Multi-level fusion (RCM-Fusion), which attempts to fuse both modalities at both feature and instance levels. For feature-level fusion, we propose a Radar Guided BEV Encoder which transforms camera features into precise BEV representations using the guidance of radar Bird's-Eye-View (BEV) features and combines the radar and camera BEV features. For instance-level fusion, we propose a Radar Grid Point Refinement module that reduces localization error by accounting for the characteristics of the radar point clouds. The experiments conducted on the public nuScenes dataset demonstrate that our proposed RCM-Fusion achieves state-of-the-art performances among single frame-based radar-camera fusion methods in the nuScenes 3D object detection benchmark. Code will be made publicly available.

CVAug 7, 2024
JARViS: Detecting Actions in Video Using Unified Actor-Scene Context Relation Modeling

Seok Hwan Lee, Taein Son, Soo Won Seo et al.

Video action detection (VAD) is a formidable vision task that involves the localization and classification of actions within the spatial and temporal dimensions of a video clip. Among the myriad VAD architectures, two-stage VAD methods utilize a pre-trained person detector to extract the region of interest features, subsequently employing these features for action detection. However, the performance of two-stage VAD methods has been limited as they depend solely on localized actor features to infer action semantics. In this study, we propose a new two-stage VAD framework called Joint Actor-scene context Relation modeling based on Visual Semantics (JARViS), which effectively consolidates cross-modal action semantics distributed globally across spatial and temporal dimensions using Transformer attention. JARViS employs a person detector to produce densely sampled actor features from a keyframe. Concurrently, it uses a video backbone to create spatio-temporal scene features from a video clip. Finally, the fine-grained interactions between actors and scenes are modeled through a Unified Action-Scene Context Transformer to directly output the final set of actions in parallel. Our experimental results demonstrate that JARViS outperforms existing methods by significant margins and achieves state-of-the-art performance on three popular VAD datasets, including AVA, UCF101-24, and JHMDB51-21.

CVMar 8, 2024
RadarDistill: Boosting Radar-based Object Detection Performance via Knowledge Distillation from LiDAR Features

Geonho Bang, Kwangjin Choi, Jisong Kim et al.

The inherent noisy and sparse characteristics of radar data pose challenges in finding effective representations for 3D object detection. In this paper, we propose RadarDistill, a novel knowledge distillation (KD) method, which can improve the representation of radar data by leveraging LiDAR data. RadarDistill successfully transfers desirable characteristics of LiDAR features into radar features using three key components: Cross-Modality Alignment (CMA), Activation-based Feature Distillation (AFD), and Proposal-based Feature Distillation (PFD). CMA enhances the density of radar features by employing multiple layers of dilation operations, effectively addressing the challenge of inefficient knowledge transfer from LiDAR to radar. AFD selectively transfers knowledge based on regions of the LiDAR features, with a specific focus on areas where activation intensity exceeds a predefined threshold. PFD similarly guides the radar network to selectively mimic features from the LiDAR network within the object proposals. Our comparative analyses conducted on the nuScenes datasets demonstrate that RadarDistill achieves state-of-the-art (SOTA) performance for radar-only object detection task, recording 20.5% in mAP and 43.7% in NDS. Also, RadarDistill significantly improves the performance of the camera-radar fusion model.

CVNov 5, 2024
CRT-Fusion: Camera, Radar, Temporal Fusion Using Motion Information for 3D Object Detection

Jisong Kim, Minjae Seong, Jun Won Choi

Accurate and robust 3D object detection is a critical component in autonomous vehicles and robotics. While recent radar-camera fusion methods have made significant progress by fusing information in the bird's-eye view (BEV) representation, they often struggle to effectively capture the motion of dynamic objects, leading to limited performance in real-world scenarios. In this paper, we introduce CRT-Fusion, a novel framework that integrates temporal information into radar-camera fusion to address this challenge. Our approach comprises three key modules: Multi-View Fusion (MVF), Motion Feature Estimator (MFE), and Motion Guided Temporal Fusion (MGTF). The MVF module fuses radar and image features within both the camera view and bird's-eye view, thereby generating a more precise unified BEV representation. The MFE module conducts two simultaneous tasks: estimation of pixel-wise velocity information and BEV segmentation. Based on the velocity and the occupancy score map obtained from the MFE module, the MGTF module aligns and fuses feature maps across multiple timestamps in a recurrent manner. By considering the motion of dynamic objects, CRT-Fusion can produce robust BEV feature maps, thereby improving detection accuracy and robustness. Extensive evaluations on the challenging nuScenes dataset demonstrate that CRT-Fusion achieves state-of-the-art performance for radar-camera-based 3D object detection. Our approach outperforms the previous best method in terms of NDS by +1.7%, while also surpassing the leading approach in mAP by +1.4%. These significant improvements in both metrics showcase the effectiveness of our proposed fusion strategy in enhancing the reliability and accuracy of 3D object detection.

CVDec 29, 2024
MR-Occ: Efficient Camera-LiDAR 3D Semantic Occupancy Prediction Using Hierarchical Multi-Resolution Voxel Representation

Minjae Seong, Jisong Kim, Geonho Bang et al.

Accurate 3D perception is essential for understanding the environment in autonomous driving. Recent advancements in 3D semantic occupancy prediction have leveraged camera-LiDAR fusion to improve robustness and accuracy. However, current methods allocate computational resources uniformly across all voxels, leading to inefficiency, and they also fail to adequately address occlusions, resulting in reduced accuracy in challenging scenarios. We propose MR-Occ, a novel approach for camera-LiDAR fusion-based 3D semantic occupancy prediction, addressing these challenges through three key components: Hierarchical Voxel Feature Refinement (HVFR), Multi-scale Occupancy Decoder (MOD), and Pixel to Voxel Fusion Network (PVF-Net). HVFR improves performance by enhancing features for critical voxels, reducing computational cost. MOD introduces an `occluded' class to better handle regions obscured from sensor view, improving accuracy. PVF-Net leverages densified LiDAR features to effectively fuse camera and LiDAR data through a deformable attention mechanism. Extensive experiments demonstrate that MR-Occ achieves state-of-the-art performance on the nuScenes-Occupancy dataset, surpassing previous approaches by +5.2% in IoU and +5.3% in mIoU while using fewer parameters and FLOPs. Moreover, MR-Occ demonstrates superior performance on the SemanticKITTI dataset, further validating its effectiveness and generalizability across diverse 3D semantic occupancy benchmarks.

CVMar 4, 2024
PillarGen: Enhancing Radar Point Cloud Density and Quality via Pillar-based Point Generation Network

Jisong Kim, Geonho Bang, Kwangjin Choi et al.

In this paper, we present a novel point generation model, referred to as Pillar-based Point Generation Network (PillarGen), which facilitates the transformation of point clouds from one domain into another. PillarGen can produce synthetic point clouds with enhanced density and quality based on the provided input point clouds. The PillarGen model performs the following three steps: 1) pillar encoding, 2) Occupied Pillar Prediction (OPP), and 3) Pillar to Point Generation (PPG). The input point clouds are encoded using a pillar grid structure to generate pillar features. Then, OPP determines the active pillars used for point generation and predicts the center of points and the number of points to be generated for each active pillar. PPG generates the synthetic points for each active pillar based on the information provided by OPP. We evaluate the performance of PillarGen using our proprietary radar dataset, focusing on enhancing the density and quality of short-range radar data using the long-range radar data as supervision. Our experiments demonstrate that PillarGen outperforms traditional point upsampling methods in quantitative and qualitative measures. We also confirm that when PillarGen is incorporated into bird's eye view object detection, a significant improvement in detection accuracy is achieved.

CVDec 18, 2024
JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts

Taein Son, Soo Won Seo, Jisong Kim et al.

Video Action Detection (VAD) entails localizing and categorizing action instances within videos, which inherently consist of diverse information sources such as audio, visual cues, and surrounding scene contexts. Leveraging this multi-modal information effectively for VAD poses a significant challenge, as the model must identify action-relevant cues with precision. In this study, we introduce a novel multi-modal VAD architecture, referred to as the Joint Actor-centric Visual, Audio, Language Encoder (JoVALE). JoVALE is the first VAD method to integrate audio and visual features with scene descriptive context sourced from large-capacity image captioning models. At the heart of JoVALE is the actor-centric aggregation of audio, visual, and scene descriptive information, enabling adaptive integration of crucial features for recognizing each actor's actions. We have developed a Transformer-based architecture, the Actor-centric Multi-modal Fusion Network, specifically designed to capture the dynamic interactions among actors and their multi-modal contexts. Our evaluation on three prominent VAD benchmarks, including AVA, UCF101-24, and JHMDB51-21, demonstrates that incorporating multi-modal information significantly enhances performance, setting new state-of-the-art performances in the field.

CVSep 22, 2025
RCTDistill: Cross-Modal Knowledge Distillation Framework for Radar-Camera 3D Object Detection with Temporal Fusion

Geonho Bang, Minjae Seong, Jisong Kim et al.

Radar-camera fusion methods have emerged as a cost-effective approach for 3D object detection but still lag behind LiDAR-based methods in performance. Recent works have focused on employing temporal fusion and Knowledge Distillation (KD) strategies to overcome these limitations. However, existing approaches have not sufficiently accounted for uncertainties arising from object motion or sensor-specific errors inherent in radar and camera modalities. In this work, we propose RCTDistill, a novel cross-modal KD method based on temporal fusion, comprising three key modules: Range-Azimuth Knowledge Distillation (RAKD), Temporal Knowledge Distillation (TKD), and Region-Decoupled Knowledge Distillation (RDKD). RAKD is designed to consider the inherent errors in the range and azimuth directions, enabling effective knowledge transfer from LiDAR features to refine inaccurate BEV representations. TKD mitigates temporal misalignment caused by dynamic objects by aligning historical radar-camera BEV features with current LiDAR representations. RDKD enhances feature discrimination by distilling relational knowledge from the teacher model, allowing the student to differentiate foreground and background features. RCTDistill achieves state-of-the-art radar-camera fusion performance on both the nuScenes and View-of-Delft (VoD) datasets, with the fastest inference speed of 26.2 FPS.

CVSep 22, 2025
MAESTRO: Task-Relevant Optimization via Adaptive Feature Enhancement and Suppression for Multi-task 3D Perception

Changwon Kang, Jisong Kim, Hongjae Shin et al.

The goal of multi-task learning is to learn to conduct multiple tasks simultaneously based on a shared data representation. While this approach can improve learning efficiency, it may also cause performance degradation due to task conflicts that arise when optimizing the model for different objectives. To address this challenge, we introduce MAESTRO, a structured framework designed to generate task-specific features and mitigate feature interference in multi-task 3D perception, including 3D object detection, bird's-eye view (BEV) map segmentation, and 3D occupancy prediction. MAESTRO comprises three components: the Class-wise Prototype Generator (CPG), the Task-Specific Feature Generator (TSFG), and the Scene Prototype Aggregator (SPA). CPG groups class categories into foreground and background groups and generates group-wise prototypes. The foreground and background prototypes are assigned to the 3D object detection task and the map segmentation task, respectively, while both are assigned to the 3D occupancy prediction task. TSFG leverages these prototype groups to retain task-relevant features while suppressing irrelevant features, thereby enhancing the performance for each task. SPA enhances the prototype groups assigned for 3D occupancy prediction by utilizing the information produced by the 3D object detection head and the map segmentation head. Extensive experiments on the nuScenes and Occ3D benchmarks demonstrate that MAESTRO consistently outperforms existing methods across 3D object detection, BEV map segmentation, and 3D occupancy prediction tasks.

CVApr 27, 2020
3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection

Jin Hyeok Yoo, Yecheol Kim, Jisong Kim et al.

In this paper, we propose a new deep architecture for fusing camera and LiDAR sensors for 3D object detection. Because the camera and LiDAR sensor signals have different characteristics and distributions, fusing these two modalities is expected to improve both the accuracy and robustness of 3D object detection. One of the challenges presented by the fusion of cameras and LiDAR is that the spatial feature maps obtained from each modality are represented by significantly different views in the camera and world coordinates; hence, it is not an easy task to combine two heterogeneous feature maps without loss of information. To address this problem, we propose a method called 3D-CVF that combines the camera and LiDAR features using the cross-view spatial feature fusion strategy. First, the method employs auto-calibrated projection, to transform the 2D camera features to a smooth spatial feature map with the highest correspondence to the LiDAR features in the bird's eye view (BEV) domain. Then, a gated feature fusion network is applied to use the spatial attention maps to mix the camera and LiDAR features appropriately according to the region. Next, camera-LiDAR feature fusion is also achieved in the subsequent proposal refinement stage. The camera feature is used from the 2D camera-view domain via 3D RoI grid pooling and fused with the BEV feature for proposal refinement. Our evaluations, conducted on the KITTI and nuScenes 3D object detection datasets demonstrate that the camera-LiDAR fusion offers significant performance gain over single modality and that the proposed 3D-CVF achieves state-of-the-art performance in the KITTI benchmark.