Jungho Lee

CV
h-index16
30papers
384citations
Novelty55%
AI Score57

30 Papers

CVAug 23, 2022
Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition

Jungho Lee, Minhyeok Lee, Dogyoon Lee et al.

Graph convolutional networks (GCNs) are the most commonly used methods for skeleton-based action recognition and have achieved remarkable performance. Generating adjacency matrices with semantically meaningful edges is particularly important for this task, but extracting such edges is challenging problem. To solve this, we propose a hierarchically decomposed graph convolutional network (HD-GCN) architecture with a novel hierarchically decomposed graph (HD-Graph). The proposed HD-GCN effectively decomposes every joint node into several sets to extract major structurally adjacent and distant edges, and uses them to construct an HD-Graph containing those edges in the same semantic spaces of a human skeleton. In addition, we introduce an attention-guided hierarchy aggregation (A-HA) module to highlight the dominant hierarchical edge sets of the HD-Graph. Furthermore, we apply a new six-way ensemble method, which uses only joint and bone stream without any motion stream. The proposed model is evaluated and achieves state-of-the-art performance on four large, popular datasets. Finally, we demonstrate the effectiveness of our model with various comparative experiments.

CVDec 9, 2022
Leveraging Spatio-Temporal Dependency for Skeleton-Based Action Recognition

Jungho Lee, Minhyeok Lee, Suhwan Cho et al.

Skeleton-based action recognition has attracted considerable attention due to its compact representation of the human body's skeletal sructure. Many recent methods have achieved remarkable performance using graph convolutional networks (GCNs) and convolutional neural networks (CNNs), which extract spatial and temporal features, respectively. Although spatial and temporal dependencies in the human skeleton have been explored separately, spatio-temporal dependency is rarely considered. In this paper, we propose the Spatio-Temporal Curve Network (STC-Net) to effectively leverage the spatio-temporal dependency of the human skeleton. Our proposed network consists of two novel elements: 1) The Spatio-Temporal Curve (STC) module; and 2) Dilated Kernels for Graph Convolution (DK-GC). The STC module dynamically adjusts the receptive field by identifying meaningful node connections between every adjacent frame and generating spatio-temporal curves based on the identified node connections, providing an adaptive spatio-temporal coverage. In addition, we propose DK-GC to consider long-range dependencies, which results in a large receptive field without any additional parameters by applying an extended kernel to the given adjacency matrices of the graph. Our STC-Net combines these two modules and achieves state-of-the-art performance on four skeleton-based action recognition benchmarks.

CVMar 15, 2023
Guided Slot Attention for Unsupervised Video Object Segmentation

Minhyeok Lee, Suhwan Cho, Dogyoon Lee et al.

Unsupervised video object segmentation aims to segment the most prominent object in a video sequence. However, the existence of complex backgrounds and multiple foreground objects make this task challenging. To address this issue, we propose a guided slot attention network to reinforce spatial structural information and obtain better foreground--background separation. The foreground and background slots, which are initialized with query guidance, are iteratively refined based on interactions with template information. Furthermore, to improve slot--template interaction and effectively fuse global and local features in the target and reference frames, K-nearest neighbors filtering and a feature aggregation transformer are introduced. The proposed model achieves state-of-the-art performance on two popular datasets. Additionally, we demonstrate the robustness of the proposed model in challenging scenes through various comparative experiments.

CVNov 22, 2022
Boundary-aware Camouflaged Object Detection via Deformable Point Sampling

Minhyeok Lee, Suhwan Cho, Chaewon Park et al.

The camouflaged object detection (COD) task aims to identify and segment objects that blend into the background due to their similar color or texture. Despite the inherent difficulties of the task, COD has gained considerable attention in several fields, such as medicine, life-saving, and anti-military fields. In this paper, we propose a novel solution called the Deformable Point Sampling network (DPS-Net) to address the challenges associated with COD. The proposed DPS-Net utilizes a Deformable Point Sampling transformer (DPS transformer) that can effectively capture sparse local boundary information of significant object boundaries in COD using a deformable point sampling method. Moreover, the DPS transformer demonstrates robust COD performance by extracting contextual features for target object localization through integrating rough global positional information of objects with boundary local information. We evaluate our method on three prominent datasets and achieve state-of-the-art performance. Our results demonstrate the effectiveness of the proposed method through comparative experiments.

CVAug 21, 2024
Video Diffusion Models are Strong Video Inpainter

Minhyeok Lee, Suhwan Cho, Chajin Shin et al.

Propagation-based video inpainting using optical flow at the pixel or feature level has recently garnered significant attention. However, it has limitations such as the inaccuracy of optical flow prediction and the propagation of noise over time. These issues result in non-uniform noise and time consistency problems throughout the video, which are particularly pronounced when the removed area is large and involves substantial movement. To address these issues, we propose a novel First Frame Filling Video Diffusion Inpainting model (FFF-VDI). We design FFF-VDI inspired by the capabilities of pre-trained image-to-video diffusion models that can transform the first frame image into a highly natural video. To apply this to the video inpainting task, we propagate the noise latent information of future frames to fill the masked areas of the first frame's noise latent code. Next, we fine-tune the pre-trained image-to-video diffusion model to generate the inpainted video. The proposed model addresses the limitations of existing methods that rely on optical flow quality, producing much more natural and temporally consistent videos. This proposed approach is the first to effectively integrate image-to-video diffusion models into video inpainting tasks. Through various comparative experiments, we demonstrate that the proposed model can robustly handle diverse inpainting types with high quality.

CVSep 26, 2023
Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation

Suhwan Cho, Minhyeok Lee, Jungho Lee et al.

Unsupervised video object segmentation aims to detect the most salient object in a video without any external guidance regarding the object. Salient objects often exhibit distinctive movements compared to the background, and recent methods leverage this by combining motion cues from optical flow maps with appearance cues from RGB images. However, because optical flow maps are often closely correlated with segmentation masks, networks can become overly dependent on motion cues during training, leading to vulnerability when faced with confusing motion cues and resulting in unstable predictions. To address this challenge, we propose a novel motion-as-option network that treats motion cues as an optional component rather than a necessity. During training, we randomly input RGB images into the motion encoder instead of optical flow maps, which implicitly reduces the network's reliance on motion cues. This design ensures that the motion encoder is capable of processing both RGB images and optical flow maps, leading to two distinct predictions depending on the type of input provided. To make the most of this flexibility, we introduce an adaptive output selection algorithm that determines the optimal prediction during testing.

48.4LGMay 24
Learning Treatment Effects during Resource Allocation via Priority-Queue Randomization

JungHo Lee, Johnna Sundberg, Pim Welle et al.

Public service programs often allocate limited resources under uncertainty about their benefits, creating a need for randomization to support credible evaluation. In practice, however, applicants commonly enter waitlists where resources are prioritized toward individuals judged to have higher need through tiered priority queues, making direct randomization difficult. Motivated by this, we develop an experimental design framework for learning treatment effects while treating those most in need where incoming applicants are randomized into priority queues based on their assessed risk scores. Treatments are then provided across queues in priority order and first-in-first-out within queue as budget becomes available. Our contributions are two-fold. First, we characterize what causal effects are identified under this priority-queue allocation. When arrivals are exogenous, treatments are conditionally randomized, and hence standard estimands are identified; when arrivals are endogenous, queue randomization instead provides an instrument for treatment, identifying local treatment effects induced by the queuing process. Second, we develop optimized queue-assignment designs that trade off statistical efficiency against prioritizing higher-need applicants. We show in the process that, despite dependence in treatment assignments induced by the design, usual iid efficiency bounds remain well-justified design objectives. We illustrate the proposed designs using data from a housing allocation program in a large U.S. county.

CVJul 9, 2024
Sparse-DeRF: Deblurred Neural Radiance Fields from Sparse View

Dogyoon Lee, Donghyeong Kim, Jungho Lee et al.

Recent studies construct deblurred neural radiance fields~(DeRF) using dozens of blurry images, which are not practical scenarios if only a limited number of blurry images are available. This paper focuses on constructing DeRF from sparse-view for more pragmatic real-world scenarios. As observed in our experiments, establishing DeRF from sparse views proves to be a more challenging problem due to the inherent complexity arising from the simultaneous optimization of blur kernels and NeRF from sparse view. Sparse-DeRF successfully regularizes the complicated joint optimization, presenting alleviated overfitting artifacts and enhanced quality on radiance fields. The regularization consists of three key components: Surface smoothness, helps the model accurately predict the scene structure utilizing unseen and additional hidden rays derived from the blur kernel based on statistical tendencies of real-world; Modulated gradient scaling, helps the model adjust the amount of the backpropagated gradient according to the arrangements of scene objects; Perceptual distillation improves the perceptual quality by overcoming the ill-posed multi-view inconsistency of image deblurring and distilling the pre-deblurred information, compensating for the lack of clean information in blurry images. We demonstrate the effectiveness of the Sparse-DeRF with extensive quantitative and qualitative experimental results by training DeRF from 2-view, 4-view, and 6-view blurry images.

CVJul 4, 2024
CRiM-GS: Continuous Rigid Motion-Aware Gaussian Splatting from Motion-Blurred Images

Jungho Lee, Donghyeong Kim, Dogyoon Lee et al.

3D Gaussian Splatting (3DGS) has gained significant attention for their high-quality novel view rendering, motivating research to address real-world challenges. A critical issue is the camera motion blur caused by movement during exposure, which hinders accurate 3D scene reconstruction. In this study, we propose CRiM-GS, a \textbf{C}ontinuous \textbf{Ri}gid \textbf{M}otion-aware \textbf{G}aussian \textbf{S}platting that reconstructs precise 3D scenes from motion-blurred images while maintaining real-time rendering speed. Considering the complex motion patterns inherent in real-world camera movements, we predict continuous camera trajectories using neural ordinary differential equations (ODE). To ensure accurate modeling, we employ rigid body transformations with proper regularization, preserving object shape and size. Additionally, we introduce an adaptive distortion-aware transformation to compensate for potential nonlinear distortions, such as rolling shutter effects, and unpredictable camera movements. By revisiting fundamental camera theory and leveraging advanced neural training techniques, we achieve precise modeling of continuous camera trajectories. Extensive experiments demonstrate state-of-the-art performance both quantitatively and qualitatively on benchmark datasets.

CVJul 16, 2024
Improving Unsupervised Video Object Segmentation via Fake Flow Generation

Suhwan Cho, Minhyeok Lee, Jungho Lee et al.

Unsupervised video object segmentation (VOS), also known as video salient object detection, aims to detect the most prominent object in a video at the pixel level. Recently, two-stream approaches that leverage both RGB images and optical flow maps have gained significant attention. However, the limited amount of training data remains a substantial challenge. In this study, we propose a novel data generation method that simulates fake optical flows from single images, thereby creating large-scale training data for stable network learning. Inspired by the observation that optical flow maps are highly dependent on depth maps, we generate fake optical flows by refining and augmenting the estimated depth maps of each image. By incorporating our simulated image-flow pairs, we achieve new state-of-the-art performance on all public benchmark datasets without relying on complex modules. We believe that our data generation method represents a potential breakthrough for future VOS research.

25.0CVApr 16
CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

Inseok Jeon, Suhwan Cho, Minhyeok Lee et al.

Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.

25.2CVMar 23
Revisiting Weakly-Supervised Video Scene Graph Generation via Pair Affinity Learning

Minseok Kang, Minhyeok Lee, Minjung Kim et al.

Weakly-supervised video scene graph generation (WS-VSGG) aims to parse video content into structured relational triplets without bounding box annotations and with only sparse temporal labeling, significantly reducing annotation costs. Without ground-truth bounding boxes, these methods rely on off-the-shelf detectors to generate object proposals, yet largely overlook a fundamental discrepancy from fullysupervised pipelines. Fully-supervised detectors implicitly filter out noninteractive objects, while off-the-shelf detectors indiscriminately detect all visible objects, overwhelming relation models with noisy pairs.We address this by introducing a learnable pair affinity that estimates the likelihood of interaction between subject-object pairs. Through Pair Affinity Learning and Scoring (PALS), pair affinity is incorporated into inferencetime ranking and further integrated into contextual reasoning through Pair Affinity Modulation (PAM), enabling the model to suppress noninteractive pairs and focus on relationally meaningful ones. To provide cleaner supervision for pair affinity learning, we further propose Relation- Aware Matching (RAM), which leverages vision-language grounding to resolve class-level ambiguity in pseudo-label generation. Extensive experiments on Action Genome demonstrate that our approach consistently yields substantial improvements across different baselines and backbones, achieving state-of-the-art WS-VSGG performance.

CVNov 11, 2025
MonoCLUE : Object-Aware Clustering Enhances Monocular 3D Object Detection

Sunghun Yang, Minhyeok Lee, Jungho Lee et al.

Monocular 3D object detection offers a cost-effective solution for autonomous driving but suffers from ill-posed depth and limited field of view. These constraints cause a lack of geometric cues and reduced accuracy in occluded or truncated scenes. While recent approaches incorporate additional depth information to address geometric ambiguity, they overlook the visual cues crucial for robust recognition. We propose MonoCLUE, which enhances monocular 3D detection by leveraging both local clustering and generalized scene memory of visual features. First, we perform K-means clustering on visual features to capture distinct object-level appearance parts (e.g., bonnet, car roof), improving detection of partially visible objects. The clustered features are propagated across regions to capture objects with similar appearances. Second, we construct a generalized scene memory by aggregating clustered features across images, providing consistent representations that generalize across scenes. This improves object-level feature consistency, enabling stable detection across varying environments. Lastly, we integrate both local cluster features and generalized scene memory into object queries, guiding attention toward informative regions. Exploiting a unified local clustering and generalized scene memory strategy, MonoCLUE enables robust monocular 3D detection under occlusion and limited visibility, achieving state-of-the-art performance on the KITTI benchmark.

CVNov 29, 2023
Synchronizing Vision and Language: Bidirectional Token-Masking AutoEncoder for Referring Image Segmentation

Minhyeok Lee, Dogyoon Lee, Jungho Lee et al.

Referring Image Segmentation (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level. Various recent RIS models have achieved state-of-the-art performance by generating contextual tokens to model multimodal features from pretrained encoders and effectively fusing them using transformer-based cross-modal attention. While these methods match language features with image features to effectively identify likely target objects, they often struggle to correctly understand contextual information in complex and ambiguous sentences and scenes. To address this issue, we propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE). The proposed model learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level. In other words, this approach involves mutually complementing across the features of images and language, with a focus on enabling the network to understand interconnected deep contextual information between the two modalities. This learning method enhances the robustness of RIS performance in complex sentences and scenes. Our BTMAE achieves state-of-the-art performance on three popular datasets, and we demonstrate the effectiveness of the proposed method through various ablation studies.

47.1CVMar 26
MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes

Wonjoon Lee, Sungmin Woo, Donghyeong Kim et al.

Online reconstruction of dynamic scenes aims to learn from streaming multi-view inputs under low-latency constraints. The fast training and real-time rendering capabilities of 3D Gaussian Splatting have made on-the-fly reconstruction practically feasible, enabling online 4D reconstruction. However, existing online approaches, despite their efficiency and visual quality, fail to learn per-Gaussian motion that reflects true scene dynamics. Without explicit motion cues, appearance and motion are optimized solely under photometric loss, causing per-Gaussian motion to chase pixel residuals rather than true 3D motion. To address this, we propose MoRGS, an efficient online per-Gaussian motion reasoning framework that explicitly models per-Gaussian motion to improve 4D reconstruction quality. Specifically, we leverage optical flow on a sparse set of key views as lightweight motion cues that regularize per-Gaussian motion beyond photometric supervision. To compensate for the sparsity of flow supervision, we learn a per-Gaussian motion offset field that reconciles discrepancies between projected 3D motion and observed flow across views and time. In addition, we introduce a per-Gaussian motion confidence that separates dynamic from static Gaussians and weights Gaussian attribute residual updates, thereby suppressing redundant motion in static regions for better temporal consistency and accelerating the modeling of large motions. Extensive experiments demonstrate that MoRGS achieves state-of-the-art reconstruction quality and motion fidelity among online methods, while maintaining streamable performance.

73.2CVMay 12
OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models

Minseok Kang, Minhyeok Lee, Jungho Lee et al.

As Video Large Language Models (Video-LLMs) scale to longer and more complex videos, their inference cost grows rapidly due to the large volume of visual tokens accumulated across frames. Training-free token compression has emerged as a practical solution to this bottleneck. However, existing temporal compression methods rely primarily on cross-frame token similarity or segmentation heuristics, overlooking each token's semantic role within its frame and failing to adapt compression strength to the compressibility of each frame pair. In this work, we propose OTT-Vid, a transport-derived allocation framework for temporal token compression. Our approach consists of two stages: spatial pruning identifies representative content within each frame, and optimal transport (OT) is then solved between neighboring frames to estimate temporal compressibility. We formulate this OT with non-uniform token mass, which protects semantically important tokens from aggressive compression, and a locality-aware cost that captures both feature and spatial disparities. The resulting transport plan jointly balances token importance and matching cost, while its total cost defines the transport difficulty of each frame pair, which we use to allocate compression budgets dynamically. Experiments on six benchmarks spanning video question answering and temporal grounding show that OTT-Vid preserves 95.8% of VQA and 73.9% of VTG performance while retaining only 10% of tokens, consistently outperforming existing state-of-the-art training-free compression methods.

CVOct 18, 2023
Mesh Represented Recycle Learning for 3D Hand Pose and Mesh Estimation

Bosang Kim, Jonghyun Kim, Hyotae Lee et al.

In general, hand pose estimation aims to improve the robustness of model performance in the real-world scenes. However, it is difficult to enhance the robustness since existing datasets are obtained in restricted environments to annotate 3D information. Although neural networks quantitatively achieve a high estimation accuracy, unsatisfied results can be observed in visual quality. This discrepancy between quantitative results and their visual qualities remains an open issue in the hand pose representation. To this end, we propose a mesh represented recycle learning strategy for 3D hand pose and mesh estimation which reinforces synthesized hand mesh representation in a training phase. To be specific, a hand pose and mesh estimation model first predicts parametric 3D hand annotations (i.e., 3D keypoint positions and vertices for hand mesh) with real-world hand images in the training phase. Second, synthetic hand images are generated with self-estimated hand mesh representations. After that, the synthetic hand images are fed into the same model again. Thus, the proposed learning strategy simultaneously improves quantitative results and visual qualities by reinforcing synthetic mesh representation. To encourage consistency between original model output and its recycled one, we propose self-correlation loss which maximizes the accuracy and reliability of our learning strategy. Consequently, the model effectively conducts self-refinement on hand pose estimation by learning mesh representation from its own output. To demonstrate the effectiveness of our learning strategy, we provide extensive experiments on FreiHAND dataset. Notably, our learning strategy improves the performance on hand pose and mesh estimation without any extra computational burden during the inference.

MLOct 27, 2023
Black-Box Optimization with Implicit Constraints for Public Policy

Wenqian Xing, JungHo Lee, Chong Liu et al.

Black-box optimization (BBO) has become increasingly relevant for tackling complex decision-making problems, especially in public policy domains such as police redistricting. However, its broader application in public policymaking is hindered by the complexity of defining feasible regions and the high-dimensionality of decisions. This paper introduces a novel BBO framework, termed as the Conditional And Generative Black-box Optimization (CageBO). This approach leverages a conditional variational autoencoder to learn the distribution of feasible decisions, enabling a two-way mapping between the original decision space and a simplified, constraint-free latent space. The CageBO efficiently handles the implicit constraints often found in public policy applications, allowing for optimization in the latent space while evaluating objectives in the original space. We validate our method through a case study on large-scale police redistricting problems in Atlanta, Georgia. Our results reveal that our CageBO offers notable improvements in performance and efficiency compared to the baselines.

CVMar 12, 2024
SMURF: Continuous Dynamics for Motion-Deblurring Radiance Fields

Jungho Lee, Dogyoon Lee, Minhyeok Lee et al.

Neural radiance fields (NeRF) has attracted considerable attention for their exceptional ability in synthesizing novel views with high fidelity. However, the presence of motion blur, resulting from slight camera movements during extended shutter exposures, poses a significant challenge, potentially compromising the quality of the reconstructed 3D scenes. To effectively handle this issue, we propose sequential motion understanding radiance fields (SMURF), a novel approach that models continuous camera motion and leverages the explicit volumetric representation method for robustness to motion-blurred input images. The core idea of the SMURF is continuous motion blurring kernel (CMBK), a module designed to model a continuous camera movements for processing blurry inputs. Our model is evaluated against benchmark datasets and demonstrates state-of-the-art performance both quantitatively and qualitatively.

CVNov 22, 2024
Effective SAM Combination for Open-Vocabulary Semantic Segmentation

Minhyeok Lee, Suhwan Cho, Jungho Lee et al.

Open-vocabulary semantic segmentation aims to assign pixel-level labels to images across an unlimited range of classes. Traditional methods address this by sequentially connecting a powerful mask proposal generator, such as the Segment Anything Model (SAM), with a pre-trained vision-language model like CLIP. But these two-stage approaches often suffer from high computational costs, memory inefficiencies. In this paper, we propose ESC-Net, a novel one-stage open-vocabulary segmentation model that leverages the SAM decoder blocks for class-agnostic segmentation within an efficient inference framework. By embedding pseudo prompts generated from image-text correlations into SAM's promptable segmentation framework, ESC-Net achieves refined spatial aggregation for accurate mask predictions. ESC-Net achieves superior performance on standard benchmarks, including ADE20K, PASCAL-VOC, and PASCAL-Context, outperforming prior methods in both efficiency and accuracy. Comprehensive ablation studies further demonstrate its robustness across challenging conditions.

CVMar 5, 2025
Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation

Suhwan Cho, Seunghoon Lee, Minhyeok Lee et al.

Referring video object segmentation aims to segment and track a target object in a video using a natural language prompt. Existing methods typically fuse visual and textual features in a highly entangled manner, processing multi-modal information together to generate per-frame masks. However, this approach often struggles with ambiguous target identification, particularly in scenes with multiple similar objects, and fails to ensure consistent mask propagation across frames. To address these limitations, we introduce FindTrack, an efficient decoupled framework that separates target identification from mask propagation. FindTrack first adaptively selects a key frame by balancing segmentation confidence and vision-text alignment, establishing a robust reference for the target object. This reference is then utilized by a dedicated propagation module to track and segment the object across the entire video. By decoupling these processes, FindTrack effectively reduces ambiguities in target association and enhances segmentation consistency. FindTrack significantly outperforms all existing methods on public benchmarks, demonstrating its superiority.

CVApr 21, 2025
GenCLIP: Generalizing CLIP Prompts for Zero-shot Anomaly Detection

Donghyeong Kim, Chaewon Park, Suhwan Cho et al.

Zero-shot anomaly detection (ZSAD) aims to identify anomalies in unseen categories by leveraging CLIP's zero-shot capabilities to match text prompts with visual features. A key challenge in ZSAD is learning general prompts stably and utilizing them effectively, while maintaining both generalizability and category specificity. Although general prompts have been explored in prior works, achieving their stable optimization and effective deployment remains a significant challenge. In this work, we propose GenCLIP, a novel framework that learns and leverages general prompts more effectively through multi-layer prompting and dual-branch inference. Multi-layer prompting integrates category-specific visual cues from different CLIP layers, enriching general prompts with more comprehensive and robust feature representations. By combining general prompts with multi-layer visual features, our method further enhances its generalization capability. To balance specificity and generalization, we introduce a dual-branch inference strategy, where a vision-enhanced branch captures fine-grained category-specific features, while a query-only branch prioritizes generalization. The complementary outputs from both branches improve the stability and reliability of anomaly detection across unseen categories. Additionally, we propose an adaptive text prompt filtering mechanism, which removes irrelevant or atypical class names not encountered during CLIP's training, ensuring that only meaningful textual inputs contribute to the final vision-language alignment.

CVDec 20, 2024
CoCoGaussian: Leveraging Circle of Confusion for Gaussian Splatting from Defocused Images

Jungho Lee, Suhwan Cho, Taeoh Kim et al.

3D Gaussian Splatting (3DGS) has attracted significant attention for its high-quality novel view rendering, inspiring research to address real-world challenges. While conventional methods depend on sharp images for accurate scene reconstruction, real-world scenarios are often affected by defocus blur due to finite depth of field, making it essential to account for realistic 3D scene representation. In this study, we propose CoCoGaussian, a Circle of Confusion-aware Gaussian Splatting that enables precise 3D scene representation using only defocused images. CoCoGaussian addresses the challenge of defocus blur by modeling the Circle of Confusion (CoC) through a physically grounded approach based on the principles of photographic defocus. Exploiting 3D Gaussians, we compute the CoC diameter from depth and learnable aperture information, generating multiple Gaussians to precisely capture the CoC shape. Furthermore, we introduce a learnable scaling factor to enhance robustness and provide more flexibility in handling unreliable depth in scenes with reflective or refractive surfaces. Experiments on both synthetic and real-world datasets demonstrate that CoCoGaussian achieves state-of-the-art performance across multiple benchmarks.

CVDec 2, 2024
STATIC : Surface Temporal Affine for TIme Consistency in Video Monocular Depth Estimation

Sunghun Yang, Minhyeok Lee, Suhwan Cho et al.

Video monocular depth estimation is essential for applications such as autonomous driving, AR/VR, and robotics. Recent transformer-based single-image monocular depth estimation models perform well on single images but struggle with depth consistency across video frames. Traditional methods aim to improve temporal consistency using multi-frame temporal modules or prior information like optical flow and camera parameters. However, these approaches face issues such as high memory use, reduced performance with dynamic or irregular motion, and limited motion understanding. We propose STATIC, a novel model that independently learns temporal consistency in static and dynamic area without additional information. A difference mask from surface normals identifies static and dynamic area by measuring directional variance. For static area, the Masked Static (MS) module enhances temporal consistency by focusing on stable regions. For dynamic area, the Surface Normal Similarity (SNS) module aligns areas and enhances temporal consistency by measuring feature similarity between frames. A final refinement integrates the independently learned static and dynamic area, enabling STATIC to achieve temporal consistency across the entire sequence. Our method achieves state-of-the-art video depth estimation on the KITTI and NYUv2 datasets without additional information.

CVNov 23, 2025
SwiftVGGT: A Scalable Visual Geometry Grounded Transformer for Large-Scale Scenes

Jungho Lee, Minhyeok Lee, Sunghun Yang et al.

3D reconstruction in large-scale scenes is a fundamental task in 3D perception, but the inherent trade-off between accuracy and computational efficiency remains a significant challenge. Existing methods either prioritize speed and produce low-quality results, or achieve high-quality reconstruction at the cost of slow inference times. In this paper, we propose SwiftVGGT, a training-free method that significantly reduce inference time while preserving high-quality dense 3D reconstruction. To maintain global consistency in large-scale scenes, SwiftVGGT performs loop closure without relying on the external Visual Place Recognition (VPR) model. This removes redundant computation and enables accurate reconstruction over kilometer-scale environments. Furthermore, we propose a simple yet effective point sampling method to align neighboring chunks using a single Sim(3)-based Singular Value Decomposition (SVD) step. This eliminates the need for the Iteratively Reweighted Least Squares (IRLS) optimization commonly used in prior work, leading to substantial speed-ups. We evaluate SwiftVGGT on multiple datasets and show that it achieves state-of-the-art reconstruction quality while requiring only 33% of the inference time of recent VGGT-based large-scale reconstruction approaches.

CVJul 26, 2025
DepthFlow: Exploiting Depth-Flow Structural Correlations for Unsupervised Video Object Segmentation

Suhwan Cho, Minhyeok Lee, Jungho Lee et al.

Unsupervised video object segmentation (VOS) aims to detect the most prominent object in a video. Recently, two-stream approaches that leverage both RGB images and optical flow have gained significant attention, but their performance is fundamentally constrained by the scarcity of training data. To address this, we propose DepthFlow, a novel data generation method that synthesizes optical flow from single images. Our approach is driven by the key insight that VOS models depend more on structural information embedded in flow maps than on their geometric accuracy, and that this structure is highly correlated with depth. We first estimate a depth map from a source image and then convert it into a synthetic flow field that preserves essential structural cues. This process enables the transformation of large-scale image-mask pairs into image-flow-mask training pairs, dramatically expanding the data available for network training. By training a simple encoder-decoder architecture with our synthesized data, we achieve new state-of-the-art performance on all public VOS benchmarks, demonstrating a scalable and effective solution to the data scarcity problem.

CVJul 26, 2025
TransFlow: Motion Knowledge Transfer from Video Diffusion Models to Video Salient Object Detection

Suhwan Cho, Minhyeok Lee, Jungho Lee et al.

Video salient object detection (SOD) relies on motion cues to distinguish salient objects from backgrounds, but training such models is limited by scarce video datasets compared to abundant image datasets. Existing approaches that use spatial transformations to create video sequences from static images fail for motion-guided tasks, as these transformations produce unrealistic optical flows that lack semantic understanding of motion. We present TransFlow, which transfers motion knowledge from pre-trained video diffusion models to generate realistic training data for video SOD. Video diffusion models have learned rich semantic motion priors from large-scale video data, understanding how different objects naturally move in real scenes. TransFlow leverages this knowledge to generate semantically-aware optical flows from static images, where objects exhibit natural motion patterns while preserving spatial boundaries and temporal coherence. Our method achieves improved performance across multiple benchmarks, demonstrating effective motion knowledge transfer.

CVMar 7, 2025
CoMoGaussian: Continuous Motion-Aware Gaussian Splatting from Motion-Blurred Images

Jungho Lee, Donghyeong Kim, Dogyoon Lee et al.

3D Gaussian Splatting (3DGS) has gained significant attention due to its high-quality novel view rendering, motivating research to address real-world challenges. A critical issue is the camera motion blur caused by movement during exposure, which hinders accurate 3D scene reconstruction. In this study, we propose CoMoGaussian, a Continuous Motion-Aware Gaussian Splatting that reconstructs precise 3D scenes from motion-blurred images while maintaining real-time rendering speed. Considering the complex motion patterns inherent in real-world camera movements, we predict continuous camera trajectories using neural ordinary differential equations (ODEs). To ensure accurate modeling, we employ rigid body transformations, preserving the shape and size of the object but rely on the discrete integration of sampled frames. To better approximate the continuous nature of motion blur, we introduce a continuous motion refinement (CMR) transformation that refines rigid transformations by incorporating additional learnable parameters. By revisiting fundamental camera theory and leveraging advanced neural ODE techniques, we achieve precise modeling of continuous camera trajectories, leading to improved reconstruction accuracy. Extensive experiments demonstrate state-of-the-art performance both quantitatively and qualitatively on benchmark datasets, which include a wide range of motion blur scenarios, from moderate to extreme blur.

CVNov 21, 2024
Transforming Static Images Using Generative Models for Video Salient Object Detection

Suhwan Cho, Minhyeok Lee, Jungho Lee et al.

In many video processing tasks, leveraging large-scale image datasets is a common strategy, as image data is more abundant and facilitates comprehensive knowledge transfer. A typical approach for simulating video from static images involves applying spatial transformations, such as affine transformations and spline warping, to create sequences that mimic temporal progression. However, in tasks like video salient object detection, where both appearance and motion cues are critical, these basic image-to-video techniques fail to produce realistic optical flows that capture the independent motion properties of each object. In this study, we show that image-to-video diffusion models can generate realistic transformations of static images while understanding the contextual relationships between image components. This ability allows the model to generate plausible optical flows, preserving semantic integrity while reflecting the independent motion of scene elements. By augmenting individual images in this way, we create large-scale image-flow pairs that significantly enhance model training. Our approach achieves state-of-the-art performance across all public benchmark datasets, outperforming existing approaches.

CVMay 2, 2023
Hybrid model for Single-Stage Multi-Person Pose Estimation

Jonghyun Kim, Bosang Kim, Hyotae Lee et al.

In general, human pose estimation methods are categorized into two approaches according to their architectures: regression (i.e., heatmap-free) and heatmap-based methods. The former one directly estimates precise coordinates of each keypoint using convolutional and fully-connected layers. Although this approach is able to detect overlapped and dense keypoints, unexpected results can be obtained by non-existent keypoints in a scene. On the other hand, the latter one is able to filter the non-existent ones out by utilizing predicted heatmaps for each keypoint. Nevertheless, it suffers from quantization error when obtaining the keypoint coordinates from its heatmaps. In addition, unlike the regression one, it is difficult to distinguish densely placed keypoints in an image. To this end, we propose a hybrid model for single-stage multi-person pose estimation, named HybridPose, which mutually overcomes each drawback of both approaches by maximizing their strengths. Furthermore, we introduce self-correlation loss to inject spatial dependencies between keypoint coordinates and their visibility. Therefore, HybridPose is capable of not only detecting densely placed keypoints, but also filtering the non-existent keypoints in an image. Experimental results demonstrate that proposed HybridPose exhibits the keypoints visibility without performance degradation in terms of the pose estimation accuracy.