Hyung-gun Chi

CV
h-index6
7papers
159citations
Novelty46%
AI Score44

7 Papers

CVMar 28, 2023
Uncovering the Missing Pattern: Unified Framework Towards Trajectory Imputation and Prediction

Yi Xu, Armin Bazarjani, Hyung-gun Chi et al.

Trajectory prediction is a crucial undertaking in understanding entity movement or human behavior from observed sequences. However, current methods often assume that the observed sequences are complete while ignoring the potential for missing values caused by object occlusion, scope limitation, sensor failure, etc. This limitation inevitably hinders the accuracy of trajectory prediction. To address this issue, our paper presents a unified framework, the Graph-based Conditional Variational Recurrent Neural Network (GC-VRNN), which can perform trajectory imputation and prediction simultaneously. Specifically, we introduce a novel Multi-Space Graph Neural Network (MS-GNN) that can extract spatial features from incomplete observations and leverage missing patterns. Additionally, we employ a Conditional VRNN with a specifically designed Temporal Decay (TD) module to capture temporal dependencies and temporal missing patterns in incomplete trajectories. The inclusion of the TD module allows for valuable information to be conveyed through the temporal flow. We also curate and benchmark three practical datasets for the joint problem of trajectory imputation and prediction. Extensive experiments verify the exceptional performance of our proposed method. As far as we know, this is the first work to address the lack of benchmarks and techniques for trajectory imputation and prediction in a unified manner.

CVOct 16, 2023Code
InfoGCN++: Learning Representation by Predicting the Future for Online Human Skeleton-based Action Recognition

Seunggeun Chi, Hyung-gun Chi, Qixing Huang et al.

Skeleton-based action recognition has made significant advancements recently, with models like InfoGCN showcasing remarkable accuracy. However, these models exhibit a key limitation: they necessitate complete action observation prior to classification, which constrains their applicability in real-time situations such as surveillance and robotic systems. To overcome this barrier, we introduce InfoGCN++, an innovative extension of InfoGCN, explicitly developed for online skeleton-based action recognition. InfoGCN++ augments the abilities of the original InfoGCN model by allowing real-time categorization of action types, independent of the observation sequence's length. It transcends conventional approaches by learning from current and anticipated future movements, thereby creating a more thorough representation of the entire sequence. Our approach to prediction is managed as an extrapolation issue, grounded on observed actions. To enable this, InfoGCN++ incorporates Neural Ordinary Differential Equations, a concept that lets it effectively model the continuous evolution of hidden states. Following rigorous evaluations on three skeleton-based action recognition benchmarks, InfoGCN++ demonstrates exceptional performance in online action recognition. It consistently equals or exceeds existing techniques, highlighting its significant potential to reshape the landscape of real-time action recognition applications. Consequently, this work represents a major leap forward from InfoGCN, pushing the limits of what's possible in online, skeleton-based action recognition. The code for InfoGCN++ is publicly available at https://github.com/stnoah1/infogcn2 for further exploration and validation.

CVJul 17, 2024
VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

Seokha Moon, Hyun Woo, Hongbeen Park et al.

Predicting future trajectories for other road agents is an essential task for autonomous vehicles. Established trajectory prediction methods primarily use agent tracks generated by a detection and tracking system and HD map as inputs. In this work, we propose a novel method that also incorporates visual input from surround-view cameras, allowing the model to utilize visual cues such as human gazes and gestures, road conditions, vehicle turn signals, etc, which are typically hidden from the model in prior methods. Furthermore, we use textual descriptions generated by a Vision-Language Model (VLM) and refined by a Large Language Model (LLM) as supervision during training to guide the model on what to learn from the input data. Despite using these extra inputs, our method achieves a latency of 53 ms, making it feasible for real-time processing, which is significantly faster than that of previous single-agent prediction methods with similar performance. Our experiments show that both the visual inputs and the textual descriptions contribute to improvements in trajectory prediction performance, and our qualitative analysis highlights how the model is able to exploit these additional inputs. Lastly, in this work we create and release the nuScenes-Text dataset, which augments the established nuScenes dataset with rich textual annotations for every scene, demonstrating the positive impact of utilizing VLM on trajectory prediction. Our project page is at https://moonseokha.github.io/VisionTrap/

CVJul 19, 2024
M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

Seunggeun Chi, Hyung-gun Chi, Hengbo Ma et al.

We introduce the Multi-Motion Discrete Diffusion Models (M2D2M), a novel approach for human motion generation from textual descriptions of multiple actions, utilizing the strengths of discrete diffusion models. This approach adeptly addresses the challenge of generating multi-motion sequences, ensuring seamless transitions of motions and coherence across a series of actions. The strength of M2D2M lies in its dynamic transition probability within the discrete diffusion model, which adapts transition probabilities based on the proximity between motion tokens, encouraging mixing between different modes. Complemented by a two-phase sampling strategy that includes independent and joint denoising steps, M2D2M effectively generates long-term, smooth, and contextually coherent human motion sequences, utilizing a model trained for single-motion generation. Extensive experiments demonstrate that M2D2M surpasses current state-of-the-art benchmarks for motion generation from text descriptions, showcasing its efficacy in interpreting language semantics and generating dynamic, realistic motions.

CVMay 16
Motion Cues from Image-based Point Tracking for LiDAR Scene Flow Estimation

Youngdong Jang, Gyeongrok Oh, Jong Wook Kim et al.

LiDAR scene flow estimation is essential for autonomous driving, as it provides 3D motion for each point. Self-supervised approaches use static-dynamic classification to mitigate the imbalance between static and dynamic points, deriving targeted supervision. However, existing methods rely on sparse geometric observations for this classification, making them vulnerable to data sparsity and occlusions. The resulting noisy labels provide incorrect motion guidance and degrade scene flow learning. To address this, we introduce TrackCue, a tracking-guided framework for improving dynamic object representation in LiDAR scene flow estimation. In particular, TrackCue repurposes point tracking to obtain dense image-space trajectories anchored to LiDAR points, providing motion cues beyond sparse geometric observations. Furthermore, we present a visually consistent motion compensation strategy that compares the tracked trajectories with ego-induced rigid trajectories in the image plane, effectively isolating true object motion from ego-induced apparent motion. To transfer these isolated motion cues back to the LiDAR domain, we perform visual motion cue lifting, which associates ego-compensated image trajectories with LiDAR points for static-dynamic label refinement. As a result, TrackCue produces more accurate static-dynamic classification and provides more reliable supervision for scene flow learning. Experimental results show that TrackCue significantly improves the precision and F1 score of dynamic labels, leading to performance gains in self-supervised scene flow estimation.

CVFeb 1, 2024
Enhanced fringe-to-phase framework using deep learning

Won-Hoe Kim, Bongjoong Kim, Hyung-Gun Chi et al.

In Fringe Projection Profilometry (FPP), achieving robust and accurate 3D reconstruction with a limited number of fringe patterns remains a challenge in structured light 3D imaging. Conventional methods require a set of fringe images, but using only one or two patterns complicates phase recovery and unwrapping. In this study, we introduce SFNet, a symmetric fusion network that transforms two fringe images into an absolute phase. To enhance output reliability, Our framework predicts refined phases by incorporating information from fringe images of a different frequency than those used as input. This allows us to achieve high accuracy with just two images. Comparative experiments and ablation studies validate the effectiveness of our proposed method. The dataset and code are publicly accessible on our project page https://wonhoe-kim.github.io/SFNet.

CVMar 19, 2025
3D Occupancy Prediction with Low-Resolution Queries via Prototype-aware View Transformation

Gyeongrok Oh, Sungjune Kim, Heeju Ko et al.

The resolution of voxel queries significantly influences the quality of view transformation in camera-based 3D occupancy prediction. However, computational constraints and the practical necessity for real-time deployment require smaller query resolutions, which inevitably leads to an information loss. Therefore, it is essential to encode and preserve rich visual details within limited query sizes while ensuring a comprehensive representation of 3D occupancy. To this end, we introduce ProtoOcc, a novel occupancy network that leverages prototypes of clustered image segments in view transformation to enhance low-resolution context. In particular, the mapping of 2D prototypes onto 3D voxel queries encodes high-level visual geometries and complements the loss of spatial information from reduced query resolutions. Additionally, we design a multi-perspective decoding strategy to efficiently disentangle the densely compressed visual cues into a high-dimensional 3D occupancy scene. Experimental results on both Occ3D and SemanticKITTI benchmarks demonstrate the effectiveness of the proposed method, showing clear improvements over the baselines. More importantly, ProtoOcc achieves competitive performance against the baselines even with 75\% reduced voxel resolution.