Chaehyun Kim

CV
h-index15
6papers
87citations
Novelty53%
AI Score48

6 Papers

CVSep 30, 2024
Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels

Heeseong Shin, Chaehyun Kim, Sunghwan Hong et al.

Large-scale vision-language models like CLIP have demonstrated impressive open-vocabulary capabilities for image-level tasks, excelling in recognizing what objects are present. However, they struggle with pixel-level recognition tasks like semantic segmentation, which additionally require understanding where the objects are located. In this work, we propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding by guiding the model on where, which is achieved using unlabeled images and masks generated from vision foundation models such as SAM and DINO. To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm using learnable class names to acquire general semantic concepts. PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods in open-vocabulary semantic segmentation. Project page is available at https://cvlab-kaist.github.io/PixelCLIP

CVDec 3, 2025
C3G: Learning Compact 3D Representations with 2K Gaussians

Honggyu An, Jaewoo Jung, Mungyeom Kim et al.

Reconstructing and understanding 3D scenes from unposed sparse views in a feed-forward manner remains as a challenging task in 3D computer vision. Recent approaches use per-pixel 3D Gaussian Splatting for reconstruction, followed by a 2D-to-3D feature lifting stage for scene understanding. However, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation, leading to degraded novel view synthesis and scene understanding performance. We propose C3G, a novel feed-forward framework that estimates compact 3D Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns for Gaussian decoding to efficiently lift features. Extensive experiments on pose-free novel view synthesis, 3D open-vocabulary segmentation, and view-invariant feature aggregation demonstrate our approach's effectiveness. Results show that a compact yet geometrically meaningful representation is sufficient for high-quality scene reconstruction and understanding, achieving superior memory efficiency and feature fidelity compared to existing methods.

CVDec 23, 2025
Repurposing Video Diffusion Transformers for Robust Point Tracking

Soowon Son, Honggyu An, Chaehyun Kim et al.

Point tracking aims to localize corresponding points across video frames, serving as a fundamental task for 4D reconstruction, robotics, and video editing. Existing methods commonly rely on shallow convolutional backbones such as ResNet that process frames independently, lacking temporal coherence and producing unreliable matching costs under challenging conditions. Through systematic analysis, we find that video Diffusion Transformers (DiTs), pre-trained on large-scale real-world videos with spatio-temporal attention, inherently exhibit strong point tracking capability and robustly handle dynamic motions and frequent occlusions. We propose DiTracker, which adapts video DiTs through: (1) query-key attention matching, (2) lightweight LoRA tuning, and (3) cost fusion with a ResNet backbone. Despite training with 8 times smaller batch size, DiTracker achieves state-of-the-art performance on challenging ITTO benchmark and matches or outperforms state-of-the-art models on TAP-Vid benchmarks. Our work validates video DiT features as an effective and efficient foundation for point tracking.

CVApr 8, 2025
D$^2$USt3R: Enhancing 3D Reconstruction for Dynamic Scenes

Jisang Han, Honggyu An, Jaewoo Jung et al.

In this work, we address the task of 3D reconstruction in dynamic scenes, where object motions frequently degrade the quality of previous 3D pointmap regression methods, such as DUSt3R, that are originally designed for static 3D scene reconstruction. Although these methods provide an elegant and powerful solution in static settings, they struggle in the presence of dynamic motions that disrupt alignment based solely on camera poses. To overcome this, we propose $D^2USt3R$ that directly regresses Static-Dynamic Aligned Pointmaps (SDAP) that simultaneiously capture both static and dynamic 3D scene geometry. By explicitly incorporating both spatial and temporal aspects, our approach successfully encapsulates 3D dense correspondence to the proposed pointmaps, enhancing downstream tasks. Extensive experimental evaluations demonstrate that our proposed approach consistently achieves superior 3D reconstruction performance across various datasets featuring complex motions.

CVSep 9, 2025
Visual Representation Alignment for Multimodal Large Language Models

Heeji Yoon, Jaewoo Jung, Junwan Kim et al.

Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.

CVSep 22, 2025
Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers

Chaehyun Kim, Heeseong Shin, Eunbeen Hong et al.

Text-to-image diffusion models excel at translating language prompts into photorealistic images by implicitly grounding textual concepts through their cross-modal attention mechanisms. Recent multi-modal diffusion transformers extend this by introducing joint self-attention over concatenated image and text tokens, enabling richer and more scalable cross-modal alignment. However, a detailed understanding of how and where these attention maps contribute to image generation remains limited. In this paper, we introduce Seg4Diff (Segmentation for Diffusion), a systematic framework for analyzing the attention structures of MM-DiT, with a focus on how specific layers propagate semantic information from text to image. Through comprehensive analysis, we identify a semantic grounding expert layer, a specific MM-DiT block that consistently aligns text tokens with spatially coherent image regions, naturally producing high-quality semantic segmentation masks. We further demonstrate that applying a lightweight fine-tuning scheme with mask-annotated image data enhances the semantic grouping capabilities of these layers and thereby improves both segmentation performance and generated image fidelity. Our findings demonstrate that semantic grouping is an emergent property of diffusion transformers and can be selectively amplified to advance both segmentation and generation performance, paving the way for unified models that bridge visual perception and generation.