CVApr 13, 2023
Soundini: Sound-Guided Diffusion for Natural Video EditingSeung Hyun Lee, Sieun Kim, Innfarn Yoo et al.
We propose a method for adding sound-guided visual effects to specific regions of videos with a zero-shot setting. Animating the appearance of the visual effect is challenging because each frame of the edited video should have visual changes while maintaining temporal consistency. Moreover, existing video editing solutions focus on temporal consistency across frames, ignoring the visual style variations over time, e.g., thunderstorm, wave, fire crackling. To overcome this limitation, we utilize temporal sound features for the dynamic style. Specifically, we guide denoising diffusion probabilistic models with an audio latent representation in the audio-visual latent space. To the best of our knowledge, our work is the first to explore sound-guided natural video editing from various sound sources with sound-specialized properties, such as intensity, timbre, and volume. Additionally, we design optical flow-based guidance to generate temporally consistent video frames, capturing the pixel-wise relationship between adjacent frames. Experimental results show that our method outperforms existing video editing techniques, producing more realistic visual effects that reflect the properties of sound. Please visit our page: https://kuai-lab.github.io/soundini-gallery/.
CVNov 14, 2022
IFQA: Interpretable Face Quality AssessmentByungho Jo, Donghyeon Cho, In Kyu Park et al.
Existing face restoration models have relied on general assessment metrics that do not consider the characteristics of facial regions. Recent works have therefore assessed their methods using human studies, which is not scalable and involves significant effort. This paper proposes a novel face-centric metric based on an adversarial framework where a generator simulates face restoration and a discriminator assesses image quality. Specifically, our per-pixel discriminator enables interpretable evaluation that cannot be provided by traditional metrics. Moreover, our metric emphasizes facial primary regions considering that even minor changes to the eyes, nose, and mouth significantly affect human cognition. Our face-oriented metric consistently surpasses existing general or facial image quality assessment metrics by impressive margins. We demonstrate the generalizability of the proposed strategy in various architectural designs and challenging scenarios. Interestingly, we find that our IFQA can lead to performance improvement as an objective function.
CVSep 13, 2022
Weakly-Supervised Stitching Network for Real-World Panoramic Image GenerationDae-Young Song, Geonsoo Lee, HeeKyung Lee et al.
Recently, there has been growing attention on an end-to-end deep learning-based stitching model. However, the most challenging point in deep learning-based stitching is to obtain pairs of input images with a narrow field of view and ground truth images with a wide field of view captured from real-world scenes. To overcome this difficulty, we develop a weakly-supervised learning mechanism to train the stitching model without requiring genuine ground truth images. In addition, we propose a stitching model that takes multiple real-world fisheye images as inputs and creates a 360 output image in an equirectangular projection format. In particular, our model consists of color consistency corrections, warping, and blending, and is trained by perceptual and SSIM losses. The effectiveness of the proposed algorithm is verified on two real-world stitching datasets.
CVOct 14, 2022
Lightweight Alpha Matting Network Using Distillation-Based Channel PruningDonggeun Yoon, Jinsun Park, Donghyeon Cho
Recently, alpha matting has received a lot of attention because of its usefulness in mobile applications such as selfies. Therefore, there has been a demand for a lightweight alpha matting model due to the limited computational resources of commercial portable devices. To this end, we suggest a distillation-based channel pruning method for the alpha matting networks. In the pruning step, we remove channels of a student network having fewer impacts on mimicking the knowledge of a teacher network. Then, the pruned lightweight student network is trained by the same distillation loss. A lightweight alpha matting model from the proposed method outperforms existing lightweight methods. To show superiority of our algorithm, we provide various quantitative and qualitative experiments with in-depth analyses. Furthermore, we demonstrate the versatility of the proposed distillation-based channel pruning method by applying it to semantic segmentation.
CVDec 5, 2023Code
Deterministic Guidance Diffusion Model for Probabilistic Weather ForecastingDonggeun Yoon, Minseok Seo, Doyi Kim et al.
Weather forecasting requires not only accuracy but also the ability to perform probabilistic prediction. However, deterministic weather forecasting methods do not support probabilistic predictions, and conversely, probabilistic models tend to be less accurate. To address these challenges, in this paper, we introduce the \textbf{\textit{D}}eterministic \textbf{\textit{G}}uidance \textbf{\textit{D}}iffusion \textbf{\textit{M}}odel (DGDM) for probabilistic weather forecasting, integrating benefits of both deterministic and probabilistic approaches. During the forward process, both the deterministic and probabilistic models are trained end-to-end. In the reverse process, weather forecasting leverages the predicted result from the deterministic model, using as an intermediate starting point for the probabilistic model. By fusing deterministic models with probabilistic models in this manner, DGDM is capable of providing accurate forecasts while also offering probabilistic predictions. To evaluate DGDM, we assess it on the global weather forecasting dataset (WeatherBench) and the common video frame prediction benchmark (Moving MNIST). We also introduce and evaluate the Pacific Northwest Windstorm (PNW)-Typhoon weather satellite dataset to verify the effectiveness of DGDM in high-resolution regional forecasting. As a result of our experiments, DGDM achieves state-of-the-art results not only in global forecasting but also in regional forecasting. The code is available at: \url{https://github.com/DongGeun-Yoon/DGDM}.
CVSep 19, 2025Code
Enriched Feature Representation and Motion Prediction Module for MOSEv2 Track of 7th LSVOS Challenge: 3rd Place SolutionChang Soo Lim, Joonyoung Moon, Donghyeon Cho
Video object segmentation (VOS) is a challenging task with wide applications such as video editing and autonomous driving. While Cutie provides strong query-based segmentation and SAM2 offers enriched representations via a pretrained ViT encoder, each has limitations in feature capacity and temporal modeling. In this report, we propose a framework that integrates their complementary strengths by replacing the encoder of Cutie with the ViT encoder of SAM2 and introducing a motion prediction module for temporal stability. We further adopt an ensemble strategy combining Cutie, SAM2, and our variant, achieving 3rd place in the MOSEv2 track of the 7th LSVOS Challenge. We refer to our final model as SCOPE (SAM2-CUTIE Object Prediction Ensemble). This demonstrates the effectiveness of enriched feature representation and motion prediction for robust video object segmentation. The code is available at https://github.com/2025-LSVOS-3rd-place/MOSEv2_3rd_place.
CVJul 3, 2020Code
Domain Adaptation without Source DataYoungeun Kim, Donghyeon Cho, Kyeongtak Han et al.
Domain adaptation assumes that samples from source and target domains are freely accessible during a training phase. However, such an assumption is rarely plausible in the real-world and possibly causes data-privacy issues, especially when the label of the source domain can be a sensitive attribute as an identifier. To avoid accessing source data that may contain sensitive information, we introduce Source data-Free Domain Adaptation (SFDA). Our key idea is to leverage a pre-trained model from the source domain and progressively update the target model in a self-learning manner. We observe that target samples with lower self-entropy measured by the pre-trained source model are more likely to be classified correctly. From this, we select the reliable samples with the self-entropy criterion and define these as class prototypes. We then assign pseudo labels for every target sample based on the similarity score with class prototypes. Furthermore, to reduce the uncertainty from the pseudo labeling process, we propose set-to-set distance-based filtering which does not require any tunable hyperparameters. Finally, we train the target model with the filtered pseudo labels with regularization from the pre-trained source model. Surprisingly, without direct usage of labeled source samples, our PrDA outperforms conventional domain adaptation methods on benchmark datasets. Our code is publicly available at https://github.com/youngryan1993/SFDA-SourceFreeDA
CVOct 13, 2025
LSVOS 2025 Challenge Report: Recent Advances in Complex Video Object SegmentationChang Liu, Henghui Ding, Kaining Ying et al.
This report presents an overview of the 7th Large-scale Video Object Segmentation (LSVOS) Challenge held in conjunction with ICCV 2025. Besides the two traditional tracks of LSVOS that jointly target robustness in realistic video scenarios: Classic VOS (VOS), and Referring VOS (RVOS), the 2025 edition features a newly introduced track, Complex VOS (MOSEv2). Building upon prior insights, MOSEv2 substantially increases difficulty, introducing more challenging but realistic scenarios including denser small objects, frequent disappear/reappear events, severe occlusions, adverse weather and lighting, etc., pushing long-term consistency and generalization beyond curated benchmarks. The challenge retains standard ${J}$, $F$, and ${J\&F}$ metrics for VOS and RVOS, while MOSEv2 adopts ${J\&\dot{F}}$ as the primary ranking metric to better evaluate objects across scales and disappearance cases. We summarize datasets and protocols, highlight top-performing solutions, and distill emerging trends, such as the growing role of LLM/MLLM components and memory-aware propagation, aiming to chart future directions for resilient, language-aware video segmentation in the wild.
CLSep 25, 2025
ReviewScore: Misinformed Peer Review Detection with Large Language ModelsHyun Ryu, Doohyuk Jang, Hyemin S. Lee et al.
Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either "weaknesses" in a review that contain incorrect premises, or "questions" in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.
CVApr 18, 2025
Early Timestep Zero-Shot Candidate Selection for Instruction-Guided Image EditingJoowon Kim, Ziseok Lee, Donghyeon Cho et al.
Despite recent advances in diffusion models, achieving reliable image generation and editing remains challenging due to the inherent diversity induced by stochastic noise in the sampling process. Instruction-guided image editing with diffusion models offers user-friendly capabilities, yet editing failures, such as background distortion, frequently occur. Users often resort to trial and error, adjusting seeds or prompts to achieve satisfactory results, which is inefficient. While seed selection methods exist for Text-to-Image (T2I) generation, they depend on external verifiers, limiting applicability, and evaluating multiple seeds increases computational complexity. To address this, we first establish a multiple-seed-based image editing baseline using background consistency scores, achieving Best-of-N performance without supervision. Building on this, we introduce ELECT (Early-timestep Latent Evaluation for Candidate Selection), a zero-shot framework that selects reliable seeds by estimating background mismatches at early diffusion timesteps, identifying the seed that retains the background while modifying only the foreground. ELECT ranks seed candidates by a background inconsistency score, filtering unsuitable samples early based on background consistency while preserving editability. Beyond standalone seed selection, ELECT integrates into instruction-guided editing pipelines and extends to Multimodal Large-Language Models (MLLMs) for joint seed and prompt selection, further improving results when seed selection alone is insufficient. Experiments show that ELECT reduces computational costs (by 41 percent on average and up to 61 percent) while improving background consistency and instruction adherence, achieving around 40 percent success rates in previously failed cases - without any external supervision or training.
CVMar 9, 2020
Restore from Restored: Video Restoration with Pseudo Clean VideoSeunghwan Lee, Donghyeon Cho, Jiwon Kim et al.
In this study, we propose a self-supervised video denoising method called "restore-from-restored." This method fine-tunes a pre-trained network by using a pseudo clean video during the test phase. The pseudo clean video is obtained by applying a noisy video to the baseline network. By adopting a fully convolutional neural network (FCN) as the baseline, we can improve video denoising performance without accurate optical flow estimation and registration steps, in contrast to many conventional video restoration methods, due to the translation equivariant property of the FCN. Specifically, the proposed method can take advantage of plentiful similar patches existing across multiple consecutive frames (i.e., patch-recurrence); these patches can boost the performance of the baseline network by a large margin. We analyze the restoration performance of the fine-tuned video denoising networks with the proposed self-supervision-based learning algorithm, and demonstrate that the FCN can utilize recurring patches without requiring accurate registration among adjacent frames. In our experiments, we apply the proposed method to state-of-the-art denoisers and show that our fine-tuned networks achieve a considerable improvement in denoising performance.
CVMar 9, 2020
Restore from Restored: Single Image Denoising with Pseudo Clean ImageSeunghwan Lee, Dongkyu Lee, Donghyeon Cho et al.
In this study, we propose a simple and effective fine-tuning algorithm called "restore-from-restored", which can greatly enhance the performance of fully pre-trained image denoising networks. Many supervised denoising approaches can produce satisfactory results using large external training datasets. However, these methods have limitations in using internal information available in a given test image. By contrast, recent self-supervised approaches can remove noise in the input image by utilizing information from the specific test input. However, such methods show relatively lower performance on known noise types such as Gaussian noise compared to supervised methods. Thus, to combine external and internal information, we fine-tune the fully pre-trained denoiser using pseudo training set at test time. By exploiting internal self-similar patches (i.e., patch-recurrence), the baseline network can be adapted to the given specific input image. We demonstrate that our method can be easily employed on top of the state-of-the-art denoising networks and further improve the performance on numerous denoising benchmark datasets including real noisy images.
CVJan 9, 2020
Fast Adaptation to Super-Resolution Networks via Meta-LearningSeobin Park, Jinsu Yoo, Donghyeon Cho et al.
Conventional supervised super-resolution (SR) approaches are trained with massive external SR datasets but fail to exploit desirable properties of the given test image. On the other hand, self-supervised SR approaches utilize the internal information within a test image but suffer from computational complexity in run-time. In this work, we observe the opportunity for further improvement of the performance of SISR without changing the architecture of conventional SR networks by practically exploiting additional information given from the input image. In the training stage, we train the network via meta-learning; thus, the network can quickly adapt to any input image at test time. Then, in the test stage, parameters of this meta-learned network are rapidly fine-tuned with only a few iterations by only using the given low-resolution image. The adaptation at the test time takes full advantage of patch-recurrence property observed in natural images. Our method effectively handles unknown SR kernels and can be applied to any existing model. We demonstrate that the proposed model-agnostic approach consistently improves the performance of conventional SR networks on various benchmark SR datasets.
CVJan 9, 2020
Self-Supervised Fast Adaptation for Denoising via Meta-LearningSeunghwan Lee, Donghyeon Cho, Jiwon Kim et al.
Under certain statistical assumptions of noise, recent self-supervised approaches for denoising have been introduced to learn network parameters without true clean images, and these methods can restore an image by exploiting information available from the given input (i.e., internal statistics) at test time. However, self-supervised methods are not yet combined with conventional supervised denoising methods which train the denoising networks with a large number of external training samples. Thus, we propose a new denoising approach that can greatly outperform the state-of-the-art supervised denoising methods by adapting their network parameters to the given input through selfsupervision without changing the networks architectures. Moreover, we propose a meta-learning algorithm to enable quick adaptation of parameters to the specific input at test time. We demonstrate that the proposed method can be easily employed with state-of-the-art denoising networks without additional parameters, and achieve state-of-the-art performance on numerous benchmark datasets.
CVAug 21, 2019
Preserving Semantic and Temporal Consistency for Unpaired Video-to-Video TranslationKwanyong Park, Sanghyun Woo, Dahun Kim et al.
In this paper, we investigate the problem of unpaired video-to-video translation. Given a video in the source domain, we aim to learn the conditional distribution of the corresponding video in the target domain, without seeing any pairs of corresponding videos. While significant progress has been made in the unpaired translation of images, directly applying these methods to an input video leads to low visual quality due to the additional time dimension. In particular, previous methods suffer from semantic inconsistency (i.e., semantic label flipping) and temporal flickering artifacts. To alleviate these issues, we propose a new framework that is composed of carefully-designed generators and discriminators, coupled with two core objective functions: 1) content preserving loss and 2) temporal consistency loss. Extensive qualitative and quantitative evaluations demonstrate the superior performance of the proposed method against previous approaches. We further apply our framework to a domain adaptation task and achieve favorable results.
CVJun 18, 2019
Key Instance Selection for Unsupervised Video Object SegmentationDonghyeon Cho, Sungeun Hong, Sungil Kang et al.
This paper proposes key instance selection based on video saliency covering objectness and dynamics for unsupervised video object segmentation (UVOS). Our method takes frames sequentially and extracts object proposals with corresponding masks for each frame. We link objects according to their similarity until the M-th frame and then assign them unique IDs (i.e., instances). Similarity measure takes into account multiple properties such as ReID descriptor, expected trajectory, and semantic co-segmentation result. After M-th frame, we select K IDs based on video saliency and frequency of appearance; then only these key IDs are tracked through the remaining frames. Thanks to these technical contributions, our results are ranked third on the leaderboard of UVOS DAVIS challenge.
CVNov 24, 2018
Self-Supervised Video Representation Learning with Space-Time Cubic PuzzlesDahun Kim, Donghyeon Cho, In So Kweon
Self-supervised tasks such as colorization, inpainting and zigsaw puzzle have been utilized for visual representation learning for still images, when the number of labeled images is limited or absent at all. Recently, this worthwhile stream of study extends to video domain where the cost of human labeling is even more expensive. However, the most of existing methods are still based on 2D CNN architectures that can not directly capture spatio-temporal information for video applications. In this paper, we introduce a new self-supervised task called as \textit{Space-Time Cubic Puzzles} to train 3D CNNs using large scale video dataset. This task requires a network to arrange permuted 3D spatio-temporal crops. By completing \textit{Space-Time Cubic Puzzles}, the network learns both spatial appearance and temporal relation of video frames, which is our final goal. In experiments, we demonstrate that our learned 3D representation is well transferred to action recognition tasks, and outperforms state-of-the-art 2D CNN-based competitors on UCF101 and HMDB51 datasets.
CVNov 24, 2018
Discriminative Feature Learning for Unsupervised Video SummarizationYunjae Jung, Donghyeon Cho, Dahun Kim et al.
In this paper, we address the problem of unsupervised video summarization that automatically extracts key-shots from an input video. Specifically, we tackle two critical issues based on our empirical observations: (i) Ineffective feature learning due to flat distributions of output importance scores for each frame, and (ii) training difficulty when dealing with long-length video inputs. To alleviate the first problem, we propose a simple yet effective regularization loss term called variance loss. The proposed variance loss allows a network to predict output scores for each frame with high discrepancy which enables effective feature learning and significantly improves model performance. For the second problem, we design a novel two-stream network named Chunk and Stride Network (CSNet) that utilizes local (chunk) and global (stride) temporal view on the video features. Our CSNet gives better summarization results for long-length videos compared to the existing methods. In addition, we introduce an attention mechanism to handle the dynamic information in videos. We demonstrate the effectiveness of the proposed methods by conducting extensive ablation studies and show that our final model achieves new state-of-the-art results on two benchmark datasets.
CVNov 15, 2018
LinkNet: Relational Embedding for Scene GraphSanghyun Woo, Dahun Kim, Donghyeon Cho et al.
Objects and their relationships are critical contents for image understanding. A scene graph provides a structured description that captures these properties of an image. However, reasoning about the relationships between objects is very challenging and only a few recent works have attempted to solve the problem of generating a scene graph from an image. In this paper, we present a method that improves scene graph generation by explicitly modeling inter-dependency among the entire object instances. We design a simple and effective relational embedding module that enables our model to jointly represent connections among all related objects, rather than focus on an object in isolation. Our method significantly benefits the main part of the scene graph generation task: relationship classification. Using it on top of a basic Faster R-CNN, our model achieves state-of-the-art results on the Visual Genome benchmark. We further push the performance by introducing global context encoding module and geometrical layout encoding module. We validate our final model, LinkNet, through extensive ablation studies, demonstrating its efficacy in scene graph generation.
CVFeb 6, 2018
Learning Image Representations by Completing Damaged Jigsaw PuzzlesDahun Kim, Donghyeon Cho, Donggeun Yoo et al.
In this paper, we explore methods of complicating self-supervised tasks for representation learning. That is, we do severe damage to data and encourage a network to recover them. First, we complicate each of three powerful self-supervised task candidates: jigsaw puzzle, inpainting, and colorization. In addition, we introduce a novel complicated self-supervised task called "Completing damaged jigsaw puzzles" which is puzzles with one piece missing and the other pieces without color. We train a convolutional neural network not only to solve the puzzles, but also generate the missing content and colorize the puzzles. The recovery of the aforementioned damage pushes the network to obtain robust and general-purpose representations. We demonstrate that complicating the self-supervised tasks improves their original versions and that our final task learns more robust and transferable representations compared to the previous methods, as well as the simple combination of our candidate tasks. Our approach achieves state-of-the-art performance in transfer learning on PASCAL classification and semantic segmentation.
CVAug 9, 2017
Weakly- and Self-Supervised Learning for Content-Aware Deep Image RetargetingDonghyeon Cho, Jinsun Park, Tae-Hyun Oh et al.
This paper proposes a weakly- and self-supervised deep convolutional neural network (WSSDCNN) for content-aware image retargeting. Our network takes a source image and a target aspect ratio, and then directly outputs a retargeted image. Retargeting is performed through a shift map, which is a pixel-wise mapping from the source to the target grid. Our method implicitly learns an attention map, which leads to a content-aware shift map for image retargeting. As a result, discriminative parts in an image are preserved, while background regions are adjusted seamlessly. In the training phase, pairs of an image and its image-level annotation are used to compute content and structure losses. We demonstrate the effectiveness of our proposed method for a retargeting application with insightful analyses.
CVAug 7, 2017
Two-Phase Learning for Weakly Supervised Object LocalizationDahun Kim, Donghyeon Cho, Donggeun Yoo et al.
Weakly supervised semantic segmentation and localiza- tion have a problem of focusing only on the most important parts of an image since they use only image-level annota- tions. In this paper, we solve this problem fundamentally via two-phase learning. Our networks are trained in two steps. In the first step, a conventional fully convolutional network (FCN) is trained to find the most discriminative parts of an image. In the second step, the activations on the most salient parts are suppressed by inference conditional feedback, and then the second learning is performed to find the area of the next most important parts. By combining the activations of both phases, the entire portion of the tar- get object can be captured. Our proposed training scheme is novel and can be utilized in well-designed techniques for weakly supervised semantic segmentation, salient region detection, and object location prediction. Detailed experi- ments demonstrate the effectiveness of our two-phase learn- ing in each task.
CVApr 28, 2017
A Unified Approach of Multi-scale Deep and Hand-crafted Features for Defocus EstimationJinsun Park, Yu-Wing Tai, Donghyeon Cho et al.
In this paper, we introduce robust and synergetic hand-crafted features and a simple but efficient deep feature from a convolutional neural network (CNN) architecture for defocus estimation. This paper systematically analyzes the effectiveness of different features, and shows how each feature can compensate for the weaknesses of other features when they are concatenated. For a full defocus map estimation, we extract image patches on strong edges sparsely, after which we use them for deep and hand-crafted feature extraction. In order to reduce the degree of patch-scale dependency, we also propose a multi-scale patch extraction strategy. A sparse defocus map is generated using a neural network classifier followed by a probability-joint bilateral filter. The final defocus map is obtained from the sparse defocus map with guidance from an edge-preserving filtered input image. Experimental results show that our algorithm is superior to state-of-the-art algorithms in terms of defocus estimation. Our work can be used for applications such as segmentation, blur magnification, all-in-focus image generation, and 3-D estimation.