CVMar 21, 2023Code
Fix the Noise: Disentangling Source Feature for Controllable Domain TranslationDongyeun Lee, Jae Young Lee, Doyeon Kim et al.
Recent studies show strong generative performance in domain translation especially by using transfer learning techniques on the unconditional generator. However, the control between different domain features using a single model is still challenging. Existing methods often require additional models, which is computationally demanding and leads to unsatisfactory visual quality. In addition, they have restricted control steps, which prevents a smooth transition. In this paper, we propose a new approach for high-quality domain translation with better controllability. The key idea is to preserve source features within a disentangled subspace of a target feature space. This allows our method to smoothly control the degree to which it preserves source features while generating images from an entirely new domain using only a single model. Our extensive experiments show that the proposed method can produce more consistent and realistic images than previous works and maintain precise controllability over different levels of transformation. The code is available at https://github.com/LeeDongYeun/FixNoise.
CLApr 1, 2022Code
Feature Structure Distillation with Centered Kernel Alignment in BERT TransferringHee-Jun Jung, Doyeon Kim, Seung-Hoon Na et al.
Knowledge distillation is an approach to transfer information on representations from a teacher to a student by reducing their difference. A challenge of this approach is to reduce the flexibility of the student's representations inducing inaccurate learning of the teacher's knowledge. To resolve it in transferring, we investigate distillation of structures of representations specified to three types: intra-feature, local inter-feature, global inter-feature structures. To transfer them, we introduce feature structure distillation methods based on the Centered Kernel Alignment, which assigns a consistent value to similar features structures and reveals more informative relations. In particular, a memory-augmented transfer method with clustering is implemented for the global structures. The methods are empirically analyzed on the nine tasks for language understanding of the GLUE dataset with Bidirectional Encoder Representations from Transformers (BERT), which is a representative neural language model. In the results, the proposed methods effectively transfer the three types of structures and improve performance compared to state-of-the-art distillation methods. Indeed, the code for the methods is available in https://github.com/maroo-sky/FSD.
ASJun 30, 2022
Learning Audio-Text Agreement for Open-vocabulary Keyword SpottingHyeon-Kyeong Shin, Hyewon Han, Doyeon Kim et al.
In this paper, we propose a novel end-to-end user-defined keyword spotting method that utilizes linguistically corresponding patterns between speech and text sequences. Unlike previous approaches requiring speech keyword enrollment, our method compares input queries with an enrolled text keyword sequence. To place the audio and text representations within a common latent space, we adopt an attention-based cross-modal matching approach that is trained in an end-to-end manner with monotonic matching loss and keyword classification loss. We also utilize a de-noising loss for the acoustic embedding network to improve robustness in noisy environments. Additionally, we introduce the LibriPhrase dataset, a new short-phrase dataset based on LibriSpeech for efficiently training keyword spotting models. Our proposed method achieves competitive results on various evaluation sets compared to other single-modal and cross-modal baselines.
CVApr 29, 2022
Fix the Noise: Disentangling Source Feature for Transfer Learning of StyleGANDongyeun Lee, Jae Young Lee, Doyeon Kim et al.
Transfer learning of StyleGAN has recently shown great potential to solve diverse tasks, especially in domain translation. Previous methods utilized a source model by swapping or freezing weights during transfer learning, however, they have limitations on visual quality and controlling source features. In other words, they require additional models that are computationally demanding and have restricted control steps that prevent a smooth transition. In this paper, we propose a new approach to overcome these limitations. Instead of swapping or freezing, we introduce a simple feature matching loss to improve generation quality. In addition, to control the degree of source features, we train a target model with the proposed strategy, FixNoise, to preserve the source features only in a disentangled subspace of a target feature space. Owing to the disentangled feature space, our method can smoothly control the degree of the source features in a single model. Extensive experiments demonstrate that the proposed method can generate more consistent and realistic images than previous works.
CVApr 23, 2021Code
TricubeNet: 2D Kernel-Based Object Representation for Weakly-Occluded Oriented Object DetectionBeomyoung Kim, Janghyeon Lee, Sihaeng Lee et al.
We present a novel approach for oriented object detection, named TricubeNet, which localizes oriented objects using visual cues ($i.e.,$ heatmap) instead of oriented box offsets regression. We represent each object as a 2D Tricube kernel and extract bounding boxes using simple image-processing algorithms. Our approach is able to (1) obtain well-arranged boxes from visual cues, (2) solve the angle discontinuity problem, and (3) can save computational complexity due to our anchor-free modeling. To further boost the performance, we propose some effective techniques for size-invariant loss, reducing false detections, extracting rotation-invariant features, and heatmap refinement. To demonstrate the effectiveness of our TricubeNet, we experiment on various tasks for weakly-occluded oriented object detection: detection in an aerial image, densely packed object image, and text image. The extensive experimental results show that our TricubeNet is quite effective for oriented object detection. Code is available at https://github.com/qjadud1994/TricubeNet.
AIFeb 19
IntentCUA: Learning Intent-level Representations for Skill Abstraction and Multi-Agent Planning in Computer-Use AgentsSeoyoung Lee, Seobin Yoon, Seongbeen Lee et al.
Computer-use agents operate over long horizons under noisy perception, multi-window contexts, evolving environment states. Existing approaches, from RL-based planners to trajectory retrieval, often drift from user intent and repeatedly solve routine subproblems, leading to error accumulation and inefficiency. We present IntentCUA, a multi-agent computer-use framework designed to stabilize long-horizon execution through intent-aligned plan memory. A Planner, Plan-Optimizer, and Critic coordinate over shared memory that abstracts raw interaction traces into multi-view intent representations and reusable skills. At runtime, intent prototypes retrieve subgroup-aligned skills and inject them into partial plans, reducing redundant re-planning and mitigating error propagation across desktop applications. In end-to-end evaluations, IntentCUA achieved a 74.83% task success rate with a Step Efficiency Ratio of 0.91, outperforming RL-based and trajectory-centric baselines. Ablations show that multi-view intent abstraction and shared plan memory jointly improve execution stability, with the cooperative multi-agent loop providing the largest gains on long-horizon tasks. These results highlight that system-level intent abstraction and memory-grounded coordination are key to reliable and efficient desktop automation in large, dynamic environments.
LGFeb 15
MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLMOmin Kwon, Yeonjae Kim, Doyeon Kim et al.
Block diffusion LLMs are emerging as a promising next paradigm for language generation, but their use of KV caching makes memory access a dominant bottleneck in long-context settings. While dynamic sparse attention has been actively explored, existing methods designed for autoregressive LLMs rely on approximate importance estimation and perform poorly when adapted to block diffusion. This work identifies a key opportunity unique to block diffusion: attention at the first All-[MASK] denoising step reliably predicts important KV entries and budget requirements, enabling MAGE to perform a single exact attention pass per block and reuse it for training-free sparse denoising. Across long-context benchmarks including LongBench and Needle-in-a-Haystack, MAGE achieves near-lossless accuracy with a fraction of the KV budget while delivering up to 3-4x end-to-end speedup, consistently outperforming AR-oriented sparse attention baselines. A lightweight fine-tuning strategy further strengthens [MASK]-guided patterns with minimal cost, requiring only a few hours of training on a single NVIDIA H100 GPU for both 1.5B and 7B models.
AIJun 17, 2025
Doppelganger Method: Breaking Role Consistency in LLM Agent via Prompt-based Transferable Adversarial AttackDaewon Kang, YeongHwan Shin, Doyeon Kim et al.
Since the advent of large language models, prompt engineering now enables the rapid, low-effort creation of diverse autonomous agents that are already in widespread use. Yet this convenience raises urgent concerns about the safety, robustness, and behavioral consistency of the underlying prompts, along with the pressing challenge of preventing those prompts from being exposed to user's attempts. In this paper, we propose the ''Doppelganger method'' to demonstrate the risk of an agent being hijacked, thereby exposing system instructions and internal information. Next, we define the ''Prompt Alignment Collapse under Adversarial Transfer (PACAT)'' level to evaluate the vulnerability to this adversarial transfer attack. We also propose a ''Caution for Adversarial Transfer (CAT)'' prompt to counter the Doppelganger method. The experimental results demonstrate that the Doppelganger method can compromise the agent's consistency and expose its internal information. In contrast, CAT prompts enable effective defense against this adversarial attack.
CVMay 30, 2023
Context-Preserving Two-Stage Video Domain Translation for Portrait StylizationDoyeon Kim, Eunji Ko, Hyunsu Kim et al.
Portrait stylization, which translates a real human face image into an artistically stylized image, has attracted considerable interest and many prior works have shown impressive quality in recent years. However, despite their remarkable performances in the image-level translation tasks, prior methods show unsatisfactory results when they are applied to the video domain. To address the issue, we propose a novel two-stage video translation framework with an objective function which enforces a model to generate a temporally coherent stylized video while preserving context in the source video. Furthermore, our model runs in real-time with the latency of 0.011 seconds per frame and requires only 5.6M parameters, and thus is widely applicable to practical real-world applications.
SDFeb 24, 2022
Phase Continuity: Learning Derivatives of Phase Spectrum for Speech EnhancementDoyeon Kim, Hyewon Han, Hyeon-Kyeong Shin et al.
Modern neural speech enhancement models usually include various forms of phase information in their training loss terms, either explicitly or implicitly. However, these loss terms are typically designed to reduce the distortion of phase spectrum values at specific frequencies, which ensures they do not significantly affect the quality of the enhanced speech. In this paper, we propose an effective phase reconstruction strategy for neural speech enhancement that can operate in noisy environments. Specifically, we introduce a phase continuity loss that considers relative phase variations across the time and frequency axes. By including this phase continuity loss in a state-of-the-art neural speech enhancement system trained with reconstruction loss and a number of magnitude spectral losses, we show that our proposed method further improves the quality of enhanced speech signals over the baseline, especially when training is done jointly with a magnitude spectrum loss.
CVFeb 7, 2022
FrePGAN: Robust Deepfake Detection Using Frequency-level PerturbationsYonghyun Jeong, Doyeon Kim, Youngmin Ro et al.
Various deepfake detectors have been proposed, but challenges still exist to detect images of unknown categories or GAN models outside of the training settings. Such issues arise from the overfitting issue, which we discover from our own analysis and the previous studies to originate from the frequency-level artifacts in generated images. We find that ignoring the frequency-level artifacts can improve the detector's generalization across various GAN models, but it can reduce the model's performance for the trained GAN models. Thus, we design a framework to generalize the deepfake detector for both the known and unseen GAN models. Our framework generates the frequency-level perturbation maps to make the generated images indistinguishable from the real images. By updating the deepfake detector along with the training of the perturbation generator, our model is trained to detect the frequency-level artifacts at the initial iterations and consider the image-level irregularities at the last iterations. For experiments, we design new test scenarios varying from the training settings in GAN models, color manipulations, and object categories. Numerous experiments validate the state-of-the-art performance of our deepfake detector.
CVJan 19, 2022
Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepthDoyeon Kim, Woonghyun Ka, Pyungwhan Ahn et al.
Depth estimation from a single image is an important task that can be applied to various fields in computer vision, and has grown rapidly with the development of convolutional neural networks. In this paper, we propose a novel structure and training strategy for monocular depth estimation to further improve the prediction accuracy of the network. We deploy a hierarchical transformer encoder to capture and convey the global context, and design a lightweight yet powerful decoder to generate an estimated depth map while considering local connectivity. By constructing connected paths between multi-scale local features and the global decoding stream with our proposed selective feature fusion module, the network can integrate both representations and recover fine details. In addition, the proposed decoder shows better performance than the previously proposed decoders, with considerably less computational complexity. Furthermore, we improve the depth-specific augmentation method by utilizing an important observation in depth estimation to enhance the model. Our network achieves state-of-the-art performance over the challenging depth dataset NYU Depth V2. Extensive experiments have been conducted to validate and show the effectiveness of the proposed approach. Finally, our model shows better generalisation ability and robustness than other comparative models.
CVDec 9, 2021
Progressive Seed Generation Auto-encoder for Unsupervised Point Cloud LearningJuyoung Yang, Pyunghwan Ahn, Doyeon Kim et al.
With the development of 3D scanning technologies, 3D vision tasks have become a popular research area. Owing to the large amount of data acquired by sensors, unsupervised learning is essential for understanding and utilizing point clouds without an expensive annotation process. In this paper, we propose a novel framework and an effective auto-encoder architecture named "PSG-Net" for reconstruction-based learning of point clouds. Unlike existing studies that used fixed or random 2D points, our framework generates input-dependent point-wise features for the latent point set. PSG-Net uses the encoded input to produce point-wise features through the seed generation module and extracts richer features in multiple stages with gradually increasing resolution by applying the seed feature propagation module progressively. We prove the effectiveness of PSG-Net experimentally; PSG-Net shows state-of-the-art performances in point cloud reconstruction and unsupervised classification, and achieves comparable performance to counterpart methods in supervised completion.
HCNov 19, 2021
A Worker-Task Specialization Model for Crowdsourcing: Efficient Inference and Fundamental LimitsDoyeon Kim, Jeonghwan Lee, Hye Won Chung
Crowdsourcing system has emerged as an effective platform for labeling data with relatively low cost by using non-expert workers. Inferring correct labels from multiple noisy answers on data, however, has been a challenging problem, since the quality of the answers varies widely across tasks and workers. Many existing works have assumed that there is a fixed ordering of workers in terms of their skill levels, and focused on estimating worker skills to aggregate the answers from workers with different weights. In practice, however, the worker skill changes widely across tasks, especially when the tasks are heterogeneous. In this paper, we consider a new model, called $d$-type specialization model, in which each task and worker has its own (unknown) type and the reliability of each worker can vary in the type of a given task and that of a worker. We allow that the number $d$ of types can scale in the number of tasks. In this model, we characterize the optimal sample complexity to correctly infer the labels within any given accuracy, and propose label inference algorithms achieving the order-wise optimal limit even when the types of tasks or those of workers are unknown. We conduct experiments both on synthetic and real datasets, and show that our algorithm outperforms the existing algorithms developed based on more strict model assumptions.
CVNov 12, 2021
Self-supervised GAN DetectorYonghyun Jeong, Doyeon Kim, Pyounggeon Kim et al.
Although the recent advancement in generative models brings diverse advantages to society, it can also be abused with malicious purposes, such as fraud, defamation, and fake news. To prevent such cases, vigorous research is conducted to distinguish the generated images from the real images, but challenges still remain to distinguish the unseen generated images outside of the training settings. Such limitations occur due to data dependency arising from the model's overfitting issue to the training data generated by specific GANs. To overcome this issue, we adopt a self-supervised scheme to propose a novel framework. Our proposed method is composed of the artificial fingerprint generator reconstructing the high-quality artificial fingerprints of GAN images for detailed analysis, and the GAN detector distinguishing GAN images by learning the reconstructed artificial fingerprints. To improve the generalization of the artificial fingerprint generator, we build multiple autoencoders with different numbers of upconvolution layers. With numerous ablation studies, the robust generalization of our method is validated by outperforming the generalization of the previous state-of-the-art algorithms, even without utilizing the GAN images of the training dataset.
CVOct 6, 2021
MToFNet: Object Anti-Spoofing with Mobile Time-of-Flight DataYonghyun Jeong, Doyeon Kim, Jaehyeon Lee et al.
In online markets, sellers can maliciously recapture others' images on display screens to utilize as spoof images, which can be challenging to distinguish in human eyes. To prevent such harm, we propose an anti-spoofing method using the paired rgb images and depth maps provided by the mobile camera with a Time-of-Fight sensor. When images are recaptured on display screens, various patterns differing by the screens as known as the moiré patterns can be also captured in spoof images. These patterns lead the anti-spoofing model to be overfitted and unable to detect spoof images recaptured on unseen media. To avoid the issue, we build a novel representation model composed of two embedding models, which can be trained without considering the recaptured images. Also, we newly introduce mToF dataset, the largest and most diverse object anti-spoofing dataset, and the first to utilize ToF data. Experimental results confirm that our model achieves robust generalization even across unseen domains.
CVOct 2, 2021
FICGAN: Facial Identity Controllable GAN for De-identificationYonghyun Jeong, Jooyoung Choi, Sungwon Kim et al.
In this work, we present Facial Identity Controllable GAN (FICGAN) for not only generating high-quality de-identified face images with ensured privacy protection, but also detailed controllability on attribute preservation for enhanced data utility. We tackle the less-explored yet desired functionality in face de-identification based on the two factors. First, we focus on the challenging issue to obtain a high level of privacy protection in the de-identification task while uncompromising the image quality. Second, we analyze the facial attributes related to identity and non-identity and explore the trade-off between the degree of face de-identification and preservation of the source attributes for enhanced data utility. Based on the analysis, we develop Facial Identity Controllable GAN (FICGAN), an autoencoder-based conditional generative model that learns to disentangle the identity attributes from non-identity attributes on a face image. By applying the manifold k-same algorithm to satisfy k-anonymity for strengthened security, our method achieves enhanced privacy protection in de-identified face images. Numerous experiments demonstrate that our model outperforms others in various scenarios of face de-identification.
CVAug 16, 2021
BiHPF: Bilateral High-Pass Filters for Robust Deepfake DetectionYonghyun Jeong, Doyeon Kim, Seungjai Min et al.
The advancement in numerous generative models has a two-fold effect: a simple and easy generation of realistic synthesized images, but also an increased risk of malicious abuse of those images. Thus, it is important to develop a generalized detector for synthesized images of any GAN model or object category, including those unseen during the training phase. However, the conventional methods heavily depend on the training settings, which cause a dramatic decline in performance when tested with unknown domains. To resolve the issue and obtain a generalized detection ability, we propose Bilateral High-Pass Filters (BiHPF), which amplify the effect of the frequency-level artifacts that are known to be found in the synthesized images of generative models. Numerous experimental results validate that our method outperforms other state-of-the-art methods, even when tested with unseen domains.
CVSep 4, 2020
TiVGAN: Text to Image to Video Generation with Step-by-Step Evolutionary GeneratorDoyeon Kim, Donggyu Joo, Junmo Kim
Advances in technology have led to the development of methods that can create desired visual multimedia. In particular, image generation using deep learning has been extensively studied across diverse fields. In comparison, video generation, especially on conditional inputs, remains a challenging and less explored area. To narrow this gap, we aim to train our model to produce a video corresponding to a given text description. We propose a novel training framework, Text-to-Image-to-Video Generative Adversarial Network (TiVGAN), which evolves frame-by-frame and finally produces a full-length video. In the first phase, we focus on creating a high-quality single video frame while learning the relationship between the text and an image. As the steps proceed, our model is trained gradually on more number of consecutive frames.This step-by-step learning process helps stabilize the training and enables the creation of high-resolution video based on conditional text descriptions. Qualitative and quantitative experimental results on various datasets demonstrate the effectiveness of the proposed method.
GEO-PHJul 18, 2020
Sequencing seismograms: A panoptic view of scattering in the core-mantle boundary regionDoyeon Kim, Vedran Lekic, Brice Ménard et al.
Scattering of seismic waves can reveal subsurface structures but usually in a piecemeal way focused on specific target areas. We used a manifold learning algorithm called "the Sequencer" to simultaneously analyze thousands of seismograms of waves diffracting along the core-mantle boundary and obtain a panoptic view of scattering across the Pacific region. In nearly half of the diffracting waveforms, we detected seismic waves scattered by three-dimensional structures near the core-mantle boundary. The prevalence of these scattered arrivals shows that the region hosts pervasive lateral heterogeneity. Our analysis revealed loud signals due to a plume root beneath Hawaii and a previously unrecognized ultralow-velocity zone beneath the Marquesas Islands. These observations illustrate how approaches flexible enough to detect robust patterns with little to no user supervision can reveal distinctive insights into the deep Earth.
HCMar 21, 2020
Crowdsourced Labeling for Worker-Task Specialization ModelDoyeon Kim, Hye Won Chung
We consider crowdsourced labeling under a $d$-type worker-task specialization model, where each worker and task is associated with one particular type among a finite set of types and a worker provides a more reliable answer to tasks of the matched type than to tasks of unmatched types. We design an inference algorithm that recovers binary task labels (up to any given recovery accuracy) by using worker clustering, worker skill estimation and weighted majority voting. The designed inference algorithm does not require any information about worker/task types, and achieves any targeted recovery accuracy with the best known performance (minimum number of queries per task).
ITSep 4, 2018
Parity Queries for Binary ClassificationHye Won Chung, Ji Oon Lee, Doyeon Kim et al.
Consider a query-based data acquisition problem that aims to recover the values of $k$ binary variables from parity (XOR) measurements of chosen subsets of the variables. Assume the response model where only a randomly selected subset of the measurements is received. We propose a method for designing a sequence of queries so that the variables can be identified with high probability using as few ($n$) measurements as possible. We define the query difficulty $\bar{d}$ as the average size of the query subsets and the sample complexity $n$ as the minimum number of measurements required to attain a given recovery accuracy. We obtain fundamental trade-offs between recovery accuracy, query difficulty, and sample complexity. In particular, the necessary and sufficient sample complexity required for recovering all $k$ variables with high probability is $n = c_0 \max\{k, (k \log k)/\bar{d}\}$ and the sample complexity for recovering a fixed proportion $(1-δ)k$ of the variables for $δ=o(1)$ is $n = c_1\max\{k, (k \log(1/δ))/\bar{d}\}$, where $c_0, c_1>0$.
CVApr 20, 2018
Generating a Fusion Image: One's Identity and Another's ShapeDonggyu Joo, Doyeon Kim, Junmo Kim
Generating a novel image by manipulating two input images is an interesting research problem in the study of generative adversarial networks (GANs). We propose a new GAN-based network that generates a fusion image with the identity of input image x and the shape of input image y. Our network can simultaneously train on more than two image datasets in an unsupervised manner. We define an identity loss LI to catch the identity of image x and a shape loss LS to get the shape of y. In addition, we propose a novel training method called Min-Patch training to focus the generator on crucial parts of an image, rather than its entirety. We show qualitative results on the VGG Youtube Pose dataset, Eye dataset (MPIIGaze and UnityEyes), and the Photo-Sketch-Cartoon dataset.