Seunghoon Lee

CV
h-index34
20papers
226citations
Novelty51%
AI Score54

20 Papers

CVSep 8, 2022
Unsupervised Video Object Segmentation via Prototype Memory Network

Minhyeok Lee, Suhwan Cho, Seunghoon Lee et al.

Unsupervised video object segmentation aims to segment a target object in the video without a ground truth mask in the initial frame. This challenging task requires extracting features for the most salient common objects within a video sequence. This difficulty can be solved by using motion information such as optical flow, but using only the information between adjacent frames results in poor connectivity between distant frames and poor performance. To solve this problem, we propose a novel prototype memory network architecture. The proposed model effectively extracts the RGB and motion information by extracting superpixel-based component prototypes from the input RGB images and optical flow maps. In addition, the model scores the usefulness of the component prototypes in each frame based on a self-learning algorithm and adaptively stores the most useful prototypes in memory and discards obsolete prototypes. We use the prototypes in the memory bank to predict the next query frames mask, which enhances the association between distant frames to help with accurate mask prediction. Our method is evaluated on three datasets, achieving state-of-the-art performance. We prove the effectiveness of the proposed model with various ablation studies.

CVSep 4, 2022
Treating Motion as Option to Reduce Motion Dependency in Unsupervised Video Object Segmentation

Suhwan Cho, Minhyeok Lee, Seunghoon Lee et al.

Unsupervised video object segmentation (VOS) aims to detect the most salient object in a video sequence at the pixel level. In unsupervised VOS, most state-of-the-art methods leverage motion cues obtained from optical flow maps in addition to appearance cues to exploit the property that salient objects usually have distinctive movements compared to the background. However, as they are overly dependent on motion cues, which may be unreliable in some cases, they cannot achieve stable prediction. To reduce this motion dependency of existing two-stream VOS methods, we propose a novel motion-as-option network that optionally utilizes motion cues. Additionally, to fully exploit the property of the proposed network that motion is not always required, we introduce a collaborative network learning strategy. On all the public benchmark datasets, our proposed network affords state-of-the-art performance with real-time inference speed.

CVNov 22, 2022
Dual Prototype Attention for Unsupervised Video Object Segmentation

Suhwan Cho, Minhyeok Lee, Seunghoon Lee et al.

Unsupervised video object segmentation (VOS) aims to detect and segment the most salient object in videos. The primary techniques used in unsupervised VOS are 1) the collaboration of appearance and motion information; and 2) temporal fusion between different frames. This paper proposes two novel prototype-based attention mechanisms, inter-modality attention (IMA) and inter-frame attention (IFA), to incorporate these techniques via dense propagation across different modalities and frames. IMA densely integrates context information from different modalities based on a mutual refinement. IFA injects global context of a video to the query frame, enabling a full utilization of useful properties from multiple frames. Experimental results on public benchmark datasets demonstrate that our proposed approach outperforms all existing methods by a substantial margin. The proposed two components are also thoroughly validated via ablative study.

SYOct 19, 2022
RT-MOT: Confidence-Aware Real-Time Scheduling Framework for Multi-Object Tracking Tasks

Donghwa Kang, Seunghoon Lee, Hoon Sung Chwa et al.

Different from existing MOT (Multi-Object Tracking) techniques that usually aim at improving tracking accuracy and average FPS, real-time systems such as autonomous vehicles necessitate new requirements of MOT under limited computing resources: (R1) guarantee of timely execution and (R2) high tracking accuracy. In this paper, we propose RT-MOT, a novel system design for multiple MOT tasks, which addresses R1 and R2. Focusing on multiple choices of a workload pair of detection and association, which are two main components of the tracking-by-detection approach for MOT, we tailor a measure of object confidence for RT-MOT and develop how to estimate the measure for the next frame of each MOT task. By utilizing the estimation, we make it possible to predict tracking accuracy variation according to different workload pairs to be applied to the next frame of an MOT task. Next, we develop a novel confidence-aware real-time scheduling framework, which offers an offline timing guarantee for a set of MOT tasks based on non-preemptive fixed-priority scheduling with the smallest workload pair. At run-time, the framework checks the feasibility of a priority-inversion associated with a larger workload pair, which does not compromise the timing guarantee of every task, and then chooses a feasible scenario that yields the largest tracking accuracy improvement based on the proposed prediction. Our experiment results demonstrate that RT-MOT significantly improves overall tracking accuracy by up to 1.5x, compared to existing popular tracking-by-detection approaches, while guaranteeing timely execution of all MOT tasks.

CVMar 8, 2023
Tsanet: Temporal and Scale Alignment for Unsupervised Video Object Segmentation

Seunghoon Lee, Suhwan Cho, Dogyoon Lee et al.

Unsupervised Video Object Segmentation (UVOS) refers to the challenging task of segmenting the prominent object in videos without manual guidance. In recent works, two approaches for UVOS have been discussed that can be divided into: appearance and appearance-motion-based methods, which have limitations respectively. Appearance-based methods do not consider the motion of the target object due to exploiting the correlation information between randomly paired frames. Appearance-motion-based methods have the limitation that the dependency on optical flow is dominant due to fusing the appearance with motion. In this paper, we propose a novel framework for UVOS that can address the aforementioned limitations of the two approaches in terms of both time and scale. Temporal Alignment Fusion aligns the saliency information of adjacent frames with the target frame to leverage the information of adjacent frames. Scale Alignment Decoder predicts the target object mask by aggregating multi-scale feature maps via continuous mapping with implicit neural representation. We present experimental results on public benchmark datasets, DAVIS 2016 and FBMS, which demonstrate the effectiveness of our method. Furthermore, we outperform the state-of-the-art methods on DAVIS 2016.

CVSep 4, 2022
Pixel-Level Equalized Matching for Video Object Segmentation

Suhwan Cho, Woo Jin Kim, MyeongAh Cho et al.

Feature similarity matching, which transfers the information of the reference frame to the query frame, is a key component in semi-supervised video object segmentation. If surjective matching is adopted, background distractors can easily occur and degrade the performance. Bijective matching mechanisms try to prevent this by restricting the amount of information being transferred to the query frame, but have two limitations: 1) surjective matching cannot be fully leveraged as it is transformed to bijective matching at test time; and 2) test-time manual tuning is required for searching the optimal hyper-parameters. To overcome these limitations while ensuring reliable information transfer, we introduce an equalized matching mechanism. To prevent the reference frame information from being overly referenced, the potential contribution to the query frame is equalized by simply applying a softmax operation along with the query. On public benchmark datasets, our proposed approach achieves a comparable performance to state-of-the-art methods.

CVJul 9, 2024
Sparse-DeRF: Deblurred Neural Radiance Fields from Sparse View

Dogyoon Lee, Donghyeong Kim, Jungho Lee et al.

Recent studies construct deblurred neural radiance fields~(DeRF) using dozens of blurry images, which are not practical scenarios if only a limited number of blurry images are available. This paper focuses on constructing DeRF from sparse-view for more pragmatic real-world scenarios. As observed in our experiments, establishing DeRF from sparse views proves to be a more challenging problem due to the inherent complexity arising from the simultaneous optimization of blur kernels and NeRF from sparse view. Sparse-DeRF successfully regularizes the complicated joint optimization, presenting alleviated overfitting artifacts and enhanced quality on radiance fields. The regularization consists of three key components: Surface smoothness, helps the model accurately predict the scene structure utilizing unseen and additional hidden rays derived from the blur kernel based on statistical tendencies of real-world; Modulated gradient scaling, helps the model adjust the amount of the backpropagated gradient according to the arrangements of scene objects; Perceptual distillation improves the perceptual quality by overcoming the ill-posed multi-view inconsistency of image deblurring and distilling the pre-deblurred information, compensating for the lack of clean information in blurry images. We demonstrate the effectiveness of the Sparse-DeRF with extensive quantitative and qualitative experimental results by training DeRF from 2-view, 4-view, and 6-view blurry images.

CVJul 16, 2024
Improving Unsupervised Video Object Segmentation via Fake Flow Generation

Suhwan Cho, Minhyeok Lee, Jungho Lee et al.

Unsupervised video object segmentation (VOS), also known as video salient object detection, aims to detect the most prominent object in a video at the pixel level. Recently, two-stream approaches that leverage both RGB images and optical flow maps have gained significant attention. However, the limited amount of training data remains a substantial challenge. In this study, we propose a novel data generation method that simulates fake optical flows from single images, thereby creating large-scale training data for stable network learning. Inspired by the observation that optical flow maps are highly dependent on depth maps, we generate fake optical flows by refining and augmenting the estimated depth maps of each image. By incorporating our simulated image-flow pairs, we achieve new state-of-the-art performance on all public benchmark datasets without relying on complex modules. We believe that our data generation method represents a potential breakthrough for future VOS research.

33.5CVApr 16
CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

Inseok Jeon, Suhwan Cho, Minhyeok Lee et al.

Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.

56.6CVApr 16
Seen-to-Scene: Keep the Seen, Generate the Unseen for Video Outpainting

Inseok Jeon, Minhyeok Lee, Seunghoon Lee et al.

Video outpainting aims to expand the visible content of a video beyond the original frame boundaries while preserving spatial fidelity and temporal coherence across frames. Existing methods primarily rely on large-scale generative models, such as diffusion models. However, generationbased approaches suffer from implicit temporal modeling and limited spatial context. These limitations lead to intraframe and inter-frame inconsistencies, which become particularly pronounced in dynamic scenes and large outpainting scenarios. To overcome these challenges, we propose Seen-to-Scene, a novel framework that unifies propagationbased and generation-based paradigms for video outpainting. Specifically, Seen-to-Scene leverages flow-based propagation with a flow completion network pre-trained for video inpainting, which is fine-tuned in an end-to-end manner to bridge the domain gap and reconstruct coherent motion fields. To further improve the efficiency and reliability of propagation, we introduce a reference-guided latent propagation that effectively propagates source content across frames. Extensive experiments demonstrate that our method achieves superior temporal coherence and visual realism with efficient inference, surpassing even prior state-of-the-art methods that require input-specific adaptation.

LGFeb 25
Easy to Learn, Yet Hard to Forget: Towards Robust Unlearning Under Bias

JuneHyoung Kwon, MiHyeon Kim, Eunju Lee et al.

Machine unlearning, which enables a model to forget specific data, is crucial for ensuring data privacy and model reliability. However, its effectiveness can be severely undermined in real-world scenarios where models learn unintended biases from spurious correlations within the data. This paper investigates the unique challenges of unlearning from such biased models. We identify a novel phenomenon we term ``shortcut unlearning," where models exhibit an ``easy to learn, yet hard to forget" tendency. Specifically, models struggle to forget easily-learned, bias-aligned samples; instead of forgetting the class attribute, they unlearn the bias attribute, which can paradoxically improve accuracy on the class intended to be forgotten. To address this, we propose CUPID, a new unlearning framework inspired by the observation that samples with different biases exhibit distinct loss landscape sharpness. Our method first partitions the forget set into causal- and bias-approximated subsets based on sample sharpness, then disentangles model parameters into causal and bias pathways, and finally performs a targeted update by routing refined causal and bias gradients to their respective pathways. Extensive experiments on biased datasets including Waterbirds, BAR, and Biased NICO++ demonstrate that our method achieves state-of-the-art forgetting performance and effectively mitigates the shortcut unlearning problem.

19.5CVMay 11
A Comparative Study of Machine Learning and Deep Learning for Out-of-Distribution Detection

Jihyeon Baek, Seunghoon Lee, Gitaek Kwon et al.

Out-of-distribution (OOD) detection is essential for building reliable AI systems, as models that produce outputs for invalid inputs cannot be trusted. Although deep learning (DL) is often assumed to outperform traditional machine learning (ML), medical imaging data are typically acquired under standardized protocols, leading to relatively constrained image variability in OOD detection tasks. This motivates a direct comparison between ML and DL approaches in this setting. The two approaches are evaluated on open datasets comprising over 60,000 fundus and non-fundus images across multiple resolutions. Both approaches achieved an AUROC of 1.000 and accuracies between 0.999 and 1.000 on internal and external validation sets, showing comparable detection performance. The ML approach, however, exhibited substantially lower end-to-end latency while maintaining equivalent accuracy, indicating greater computational efficiency. These results suggest that for OOD detection tasks of limited visual complexity, lightweight ML approaches can achieve DL-level performance with significantly reduced computational cost, supporting practical real-world deployment.

CVMar 5, 2025
Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation

Suhwan Cho, Seunghoon Lee, Minhyeok Lee et al.

Referring video object segmentation aims to segment and track a target object in a video using a natural language prompt. Existing methods typically fuse visual and textual features in a highly entangled manner, processing multi-modal information together to generate per-frame masks. However, this approach often struggles with ambiguous target identification, particularly in scenes with multiple similar objects, and fails to ensure consistent mask propagation across frames. To address these limitations, we introduce FindTrack, an efficient decoupled framework that separates target identification from mask propagation. FindTrack first adaptively selects a key frame by balancing segmentation confidence and vision-text alignment, establishing a robust reference for the target object. This reference is then utilized by a dedicated propagation module to track and segment the object across the entire video. By decoupling these processes, FindTrack effectively reduces ambiguities in target association and enhances segmentation consistency. FindTrack significantly outperforms all existing methods on public benchmarks, demonstrating its superiority.

CVOct 23, 2025
AccuQuant: Simulating Multiple Denoising Steps for Quantizing Diffusion Models

Seunghoon Lee, Jeongwoo Choi, Byunggwan Son et al.

We present in this paper a novel post-training quantization (PTQ) method, dubbed AccuQuant, for diffusion models. We show analytically and empirically that quantization errors for diffusion models are accumulated over denoising steps in a sampling process. To alleviate the error accumulation problem, AccuQuant minimizes the discrepancies between outputs of a full-precision diffusion model and its quantized version within a couple of denoising steps. That is, it simulates multiple denoising steps of a diffusion sampling process explicitly for quantization, accounting the accumulated errors over multiple denoising steps, which is in contrast to previous approaches to imitating a training process of diffusion models, namely, minimizing the discrepancies independently for each step. We also present an efficient implementation technique for AccuQuant, together with a novel objective, which reduces a memory complexity significantly from $\mathcal{O}(n)$ to $\mathcal{O}(1)$, where $n$ is the number of denoising steps. We demonstrate the efficacy and efficiency of AccuQuant across various tasks and diffusion models on standard benchmarks.

QUANT-PHOct 8, 2021
The Parallel Reversible Pebbling Game: Analyzing the Post-Quantum Security of iMHFs

Jeremiah Blocki, Blake Holman, Seunghoon Lee

The classical (parallel) black pebbling game is a useful abstraction which allows us to analyze the resources (space, space-time, cumulative space) necessary to evaluate a function $f$ with a static data-dependency graph $G$. Of particular interest in the field of cryptography are data-independent memory-hard functions $f_{G,H}$ which are defined by a directed acyclic graph (DAG) $G$ and a cryptographic hash function $H$. The pebbling complexity of the graph $G$ characterizes the amortized cost of evaluating $f_{G,H}$ multiple times as well as the total cost to run a brute-force preimage attack over a fixed domain $\mathcal{X}$, i.e., given $y \in \{0,1\}^*$ find $x \in \mathcal{X}$ such that $f_{G,H}(x)=y$. While a classical attacker will need to evaluate the function $f_{G,H}$ at least $m=|\mathcal{X}|$ times a quantum attacker running Grover's algorithm only requires $\mathcal{O}(\sqrt{m})$ blackbox calls to a quantum circuit $C_{G,H}$ evaluating the function $f_{G,H}$. Thus, to analyze the cost of a quantum attack it is crucial to understand the space-time cost (equivalently width times depth) of the quantum circuit $C_{G,H}$. We first observe that a legal black pebbling strategy for the graph $G$ does not necessarily imply the existence of a quantum circuit with comparable complexity -- in contrast to the classical setting where any efficient pebbling strategy for $G$ corresponds to an algorithm with comparable complexity evaluating $f_{G,H}$. Motivated by this observation we introduce a new parallel reversible pebbling game which captures additional restrictions imposed by the No-Deletion Theorem in Quantum Computing. We apply our new reversible pebbling game to analyze the reversible space-time complexity of several important graphs: Line Graphs, Argon2i-A, Argon2i-B, and DRSample. (See the paper for the full abstract.)

DSOct 8, 2021
On Explicit Constructions of Extremely Depth Robust Graphs

Jeremiah Blocki, Mike Cinkoske, Seunghoon Lee et al.

A directed acyclic graph $G=(V,E)$ is said to be $(e,d)$-depth robust if for every subset $S \subseteq V$ of $|S| \leq e$ nodes the graph $G-S$ still contains a directed path of length $d$. If the graph is $(e,d)$-depth-robust for any $e,d$ such that $e+d \leq (1-ε)|V|$ then the graph is said to be $ε$-extreme depth-robust. In the field of cryptography, (extremely) depth-robust graphs with low indegree have found numerous applications including the design of side-channel resistant Memory-Hard Functions, Proofs of Space and Replication, and in the design of Computationally Relaxed Locally Correctable Codes. In these applications, it is desirable to ensure the graphs are locally navigable, i.e., there is an efficient algorithm $\mathsf{GetParents}$ running in time $\mathrm{polylog} |V|$ which takes as input a node $v \in V$ and returns the set of $v$'s parents. We give the first explicit construction of locally navigable $ε$-extreme depth-robust graphs with indegree $O(\log |V|)$. Previous constructions of $ε$-extreme depth-robust graphs either had indegree $\tildeω(\log^2 |V|)$ or were not explicit.

SPSep 14, 2021
Bayesian AirComp with Sign-Alignment Precoding for Wireless Federated Learning

Chanho Park, Seunghoon Lee, Namyoon Lee

In this paper, we consider the problem of wireless federated learning based on sign stochastic gradient descent (signSGD) algorithm via a multiple access channel. When sending locally computed gradient's sign information, each mobile device requires to apply precoding to circumvent wireless fading effects. In practice, however, acquiring perfect knowledge of channel state information (CSI) at all mobile devices is infeasible. In this paper, we present a simple yet effective precoding method with limited channel knowledge, called sign-alignment precoding. The idea of sign-alignment precoding is to protect sign-flipping errors from wireless fadings. Under the Gaussian prior assumption on the local gradients, we also derive the mean squared error (MSE)-optimal aggregation function called Bayesian over-the-air computation (BayAirComp). Our key finding is that one-bit precoding with BayAirComp aggregation can provide a better learning performance than the existing precoding method even using perfect CSI with AirComp aggregation.

SPDec 31, 2020
Bayesian Federated Learning over Wireless Networks

Seunghoon Lee, Chanho Park, Song-Nam Hong et al.

Federated learning is a privacy-preserving and distributed training method using heterogeneous data sets stored at local devices. Federated learning over wireless networks requires aggregating locally computed gradients at a server where the mobile devices send statistically distinct gradient information over heterogenous communication links. This paper proposes a Bayesian federated learning (BFL) algorithm to aggregate the heterogeneous quantized gradient information optimally in the sense of minimizing the mean-squared error (MSE). The idea of BFL is to aggregate the one-bit quantized local gradients at the server by jointly exploiting i) the prior distributions of the local gradients, ii) the gradient quantizer function, and iii) channel distributions. Implementing BFL requires high communication and computational costs as the number of mobile devices increases. To address this challenge, we also present an efficient modified BFL algorithm called scalable-BFL (SBFL). In SBFL, we assume a simplified distribution on the local gradient. Each mobile device sends its one-bit quantized local gradient together with two scalar parameters representing this distribution. The server then aggregates the noisy and faded quantized gradients to minimize the MSE. We provide a convergence analysis of SBFL for a class of non-convex loss functions. Our analysis elucidates how the parameters of communication channels and the gradient priors affect convergence. From simulations, we demonstrate that SBFL considerably outperforms the conventional sign stochastic gradient descent algorithm when training and testing neural networks using MNIST data sets over heterogeneous wireless networks.

CRJun 19, 2020
On the Security of Proofs of Sequential Work in a Post-Quantum World

Jeremiah Blocki, Seunghoon Lee, Samson Zhou

A Proof of Sequential Work (PoSW) allows a prover to convince a resource-bounded verifier that the prover invested a substantial amount of sequential time to perform some underlying computation. PoSWs have many applications including time-stamping, blockchain design, and universally verifiable CPU benchmarks. Mahmoody, Moran, and Vadhan (ITCS 2013) gave the first construction of a PoSW in the random oracle model though the construction relied on expensive depth-robust graphs. In a recent breakthrough, Cohen and Pietrzak (EUROCRYPT 2018) gave an efficient PoSW construction that does not require expensive depth-robust graphs. In the classical parallel random oracle model, it is straightforward to argue that any successful PoSW attacker must produce a long $\mathcal{H}$-sequence and that any malicious party running in sequential time $T-1$ will fail to produce an $\mathcal{H}$-sequence of length $T$ except with negligible probability. In this paper, we prove that any quantum attacker running in sequential time $T-1$ will fail to produce an $\mathcal{H}$-sequence except with negligible probability -- even if the attacker submits a large batch of quantum queries in each round. The proof is substantially more challenging and highlights the power of Zhandry's recent compressed oracle technique (CRYPTO 2019). We further extend this result to establish post-quantum security of a non-interactive PoSW obtained by applying the Fiat-Shamir transform to Cohen and Pietrzak's efficient construction (EUROCRYPT 2018).

CCApr 17, 2019
Approximating Cumulative Pebbling Cost is Unique Games Hard

Jeremiah Blocki, Seunghoon Lee, Samson Zhou

The cumulative pebbling complexity of a directed acyclic graph $G$ is defined as $\mathsf{cc}(G) = \min_P \sum_i |P_i|$, where the minimum is taken over all legal (parallel) black pebblings of $G$ and $|P_i|$ denotes the number of pebbles on the graph during round $i$. Intuitively, $\mathsf{cc}(G)$ captures the amortized Space-Time complexity of pebbling $m$ copies of $G$ in parallel. The cumulative pebbling complexity of a graph $G$ is of particular interest in the field of cryptography as $\mathsf{cc}(G)$ is tightly related to the amortized Area-Time complexity of the Data-Independent Memory-Hard Function (iMHF) $f_{G,H}$ [AS15] defined using a constant indegree directed acyclic graph (DAG) $G$ and a random oracle $H(\cdot)$. A secure iMHF should have amortized Space-Time complexity as high as possible, e.g., to deter brute-force password attacker who wants to find $x$ such that $f_{G,H}(x) = h$. Thus, to analyze the (in)security of a candidate iMHF $f_{G,H}$, it is crucial to estimate the value $\mathsf{cc}(G)$ but currently, upper and lower bounds for leading iMHF candidates differ by several orders of magnitude. Blocki and Zhou recently showed that it is $\mathsf{NP}$-Hard to compute $\mathsf{cc}(G)$, but their techniques do not even rule out an efficient $(1+\varepsilon)$-approximation algorithm for any constant $\varepsilon>0$. We show that for any constant $c > 0$, it is Unique Games hard to approximate $\mathsf{cc}(G)$ to within a factor of $c$. (See the paper for the full abstract.)