Dongwook Lee

CV
h-index6
16papers
722citations
Novelty53%
AI Score61

16 Papers

COMP-PHMar 27, 2013
A Solution Accurate, Efficient and Stable Unsplit Staggered Mesh Scheme for Three Dimensional Magnetohydrodynamics

Dongwook Lee

In this paper, we extend the unsplit staggered mesh scheme (USM) for 2D magnetohydrodynamics (MHD) (Lee and Deane, 2009) to a full 3D MHD scheme. The scheme is a finite-volume Godunov method consisting of a constrained transport (CT) method and an efficient and accurate single-step, directionally unsplit multidimensional data reconstruction-evolution algorithm, which extends the original 2D corner transport upwind (CTU) method (Colella, 1990). We present two types of data reconstruction-evolution algorithms for 3D: (1) a reduced CTU scheme and (2) a full CTU scheme. The reduced 3D CTU scheme is a variant of a simple 3D extension of the 2D CTU method by Colella (1990) and is considered as a direct extension from the 2D USM scheme. The full 3D CTU scheme is our primary 3D solver which includes all multidimensional cross-derivative terms for stability. The latter method is logically analogous to the 3D unsplit CTU method by Saltzman. The major novelties in our algorithms are twofold. First, we extend the reduced CTU scheme to the full CTU scheme which is able to run with CFL numbers close to unity. Both methods utilize the transverse update technique developed in the 2D USM algorithm to account for transverse fluxes without solving intermediate Riemann problems, which in turn gives cost-effective 3D methods by reducing the total number of Riemann solves. The proposed algorithms are simple and efficient especially when including multidimensional MHD terms that maintain in-plane magnetic field dynamics. Second, we introduce a new CT scheme that makes use of proper upwind information in taking averages of electric fields.

CVAug 21, 2023
BackTrack: Robust template update via Backward Tracking of candidate template

Dongwook Lee, Wonjun Choi, Seohyung Lee et al.

Variations of target appearance such as deformations, illumination variance, occlusion, etc., are the major challenges of visual object tracking that negatively impact the performance of a tracker. An effective method to tackle these challenges is template update, which updates the template to reflect the change of appearance in the target object during tracking. However, with template updates, inadequate quality of new templates or inappropriate timing of updates may induce a model drift problem, which severely degrades the tracking performance. Here, we propose BackTrack, a robust and reliable method to quantify the confidence of the candidate template by backward tracking it on the past frames. Based on the confidence score of candidates from BackTrack, we can update the template with a reliable candidate at the right time while rejecting unreliable candidates. BackTrack is a generic template update scheme and is applicable to any template-based trackers. Extensive experiments on various tracking benchmarks verify the effectiveness of BackTrack over existing template update algorithms, as it achieves SOTA performance on various tracking benchmarks.

41.4CVMay 11Code
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

Yeongtak Oh, Dongwook Lee, Sangkwon Park et al.

While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor to account for absent-persona scenarios or systematic grounding studies. We introduce Omni-Persona, the first comprehensive benchmark for omnimodal personalization. We formalize the task as cross-modal routing over the \emph{Persona Modality Graph}, encompassing 4 task groups and 18 fine-grained tasks across ${\sim}750$ items. To rigorously diagnose grounding behavior, we propose \emph{Calibrated Accuracy ($\mathrm{Cal}$)}, which jointly rewards correct grounding and appropriate abstention, incorporating absent-persona queries within a unified evaluation framework. On our dedicated experiments, three diagnostic findings emerge: (i) open-source models show a consistent audio-vs-visual grounding gap that RLVR partially narrows via dense rule-based supervision; (ii) answerable recall and parameter scale are incomplete diagnostics, since strong recall can coexist with absent-persona hallucination and larger models do not always achieve higher $\mathrm{Cal}$, exposing calibration as a separate evaluation axis; and (iii) SFT is bounded by the difficulty of constructing annotated ground-truth supervision at scale, while RLVR generalizes more consistently through outcome-level verifiable feedback yet drifts toward conservative behavior and lower generation quality under our reward design. Omni-Persona thus serves as a diagnostic framework that surfaces the pitfalls of omnimodal personalization, guiding future post-training and reward design.

23.2CLApr 19
Still Between Us? Evaluating and Improving Voice Assistant Robustness to Third-Party Interruptions

Dongwook Lee, Eunwoo Song, Che Hyun Lee et al.

While recent Spoken Language Models (SLMs) have been actively deployed in real-world scenarios, they lack the capability to discern Third-Party Interruptions (TPI) from the primary user's ongoing flow, leaving them vulnerable to contextual failures. To bridge this gap, we introduce TPI-Train, a dataset of 88K instances designed with speaker-aware hard negatives to enforce acoustic cue prioritization for interruption handling, and TPI-Bench, a comprehensive evaluation framework designed to rigorously measure the interruption-handling strategy and precise speaker discrimination in deceptive contexts. Experiments demonstrate that our dataset design mitigates semantic shortcut learning-a critical pitfall where models exploit semantic context while neglecting acoustic signals essential for discerning speaker changes. We believe our work establishes a foundational resource for overcoming text-dominated unimodal reliance in SLMs, paving the way for more robust multi-party spoken interaction. The code for the framework is publicly available at https://tpi-va.github.io

LGApr 6, 2024Code
Compositional Conservatism: A Transductive Approach in Offline Reinforcement Learning

Yeda Song, Dongwook Lee, Gunhee Kim

Offline reinforcement learning (RL) is a compelling framework for learning optimal policies from past experiences without additional interaction with the environment. Nevertheless, offline RL inevitably faces the problem of distributional shifts, where the states and actions encountered during policy execution may not be in the training dataset distribution. A common solution involves incorporating conservatism into the policy or the value function to safeguard against uncertainties and unknowns. In this work, we focus on achieving the same objectives of conservatism but from a different perspective. We propose COmpositional COnservatism with Anchor-seeking (COCOA) for offline RL, an approach that pursues conservatism in a compositional manner on top of the transductive reparameterization (Netanyahu et al., 2023), which decomposes the input variable (the state in our case) into an anchor and its difference from the original input. Our COCOA seeks both in-distribution anchors and differences by utilizing the learned reverse dynamics model, encouraging conservatism in the compositional input space for the policy or value function. Such compositional conservatism is independent of and agnostic to the prevalent behavioral conservatism in offline RL. We apply COCOA to four state-of-the-art offline RL algorithms and evaluate them on the D4RL benchmark, where COCOA generally improves the performance of each algorithm. The code is available at https://github.com/runamu/compositional-conservatism.

CVNov 3, 2025
Diffusion Transformer meets Multi-level Wavelet Spectrum for Single Image Super-Resolution

Peng Du, Hui Li, Han Xu et al.

Discrete Wavelet Transform (DWT) has been widely explored to enhance the performance of image superresolution (SR). Despite some DWT-based methods improving SR by capturing fine-grained frequency signals, most existing approaches neglect the interrelations among multiscale frequency sub-bands, resulting in inconsistencies and unnatural artifacts in the reconstructed images. To address this challenge, we propose a Diffusion Transformer model based on image Wavelet spectra for SR (DTWSR). DTWSR incorporates the superiority of diffusion models and transformers to capture the interrelations among multiscale frequency sub-bands, leading to a more consistence and realistic SR image. Specifically, we use a Multi-level Discrete Wavelet Transform to decompose images into wavelet spectra. A pyramid tokenization method is proposed which embeds the spectra into a sequence of tokens for transformer model, facilitating to capture features from both spatial and frequency domain. A dual-decoder is designed elaborately to handle the distinct variances in low-frequency and high-frequency sub-bands, without omitting their alignment in image generation. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our method, with high performance on both perception quality and fidelity.

CVMar 6, 2024
CMDA: Cross-Modal and Domain Adversarial Adaptation for LiDAR-Based 3D Object Detection

Gyusam Chang, Wonseok Roh, Sujin Jang et al.

Recent LiDAR-based 3D Object Detection (3DOD) methods show promising results, but they often do not generalize well to target domains outside the source (or training) data distribution. To reduce such domain gaps and thus to make 3DOD models more generalizable, we introduce a novel unsupervised domain adaptation (UDA) method, called CMDA, which (i) leverages visual semantic cues from an image modality (i.e., camera images) as an effective semantic bridge to close the domain gap in the cross-modal Bird's Eye View (BEV) representations. Further, (ii) we also introduce a self-training-based learning strategy, wherein a model is adversarially trained to generate domain-invariant features, which disrupt the discrimination of whether a feature instance comes from a source or an unseen target domain. Overall, our CMDA framework guides the 3DOD model to generate highly informative and domain-adaptive features for novel data distributions. In our extensive experiments with large-scale benchmarks, such as nuScenes, Waymo, and KITTI, those mentioned above provide significant performance gains for UDA tasks, achieving state-of-the-art performance.

CVOct 29, 2024
Unified Domain Generalization and Adaptation for Multi-View 3D Object Detection

Gyusam Chang, Jiwon Lee, Donghyun Kim et al.

Recent advances in 3D object detection leveraging multi-view cameras have demonstrated their practical and economical value in various challenging vision tasks. However, typical supervised learning approaches face challenges in achieving satisfactory adaptation toward unseen and unlabeled target datasets (\ie, direct transfer) due to the inevitable geometric misalignment between the source and target domains. In practice, we also encounter constraints on resources for training models and collecting annotations for the successful deployment of 3D object detectors. In this paper, we propose Unified Domain Generalization and Adaptation (UDGA), a practical solution to mitigate those drawbacks. We first propose Multi-view Overlap Depth Constraint that leverages the strong association between multi-view, significantly alleviating geometric gaps due to perspective view changes. Then, we present a Label-Efficient Domain Adaptation approach to handle unfamiliar targets with significantly fewer amounts of labels (\ie, 1$\%$ and 5$\%)$, while preserving well-defined source knowledge for training efficiency. Overall, UDGA framework enables stable detection performance in both source and target domains, effectively bridging inevitable domain gaps, while demanding fewer annotations. We demonstrate the robustness of UDGA with large-scale benchmarks: nuScenes, Lyft, and Waymo, where our framework outperforms the current state-of-the-art methods.

CVMar 19, 2025
3D Occupancy Prediction with Low-Resolution Queries via Prototype-aware View Transformation

Gyeongrok Oh, Sungjune Kim, Heeju Ko et al.

The resolution of voxel queries significantly influences the quality of view transformation in camera-based 3D occupancy prediction. However, computational constraints and the practical necessity for real-time deployment require smaller query resolutions, which inevitably leads to an information loss. Therefore, it is essential to encode and preserve rich visual details within limited query sizes while ensuring a comprehensive representation of 3D occupancy. To this end, we introduce ProtoOcc, a novel occupancy network that leverages prototypes of clustered image segments in view transformation to enhance low-resolution context. In particular, the mapping of 2D prototypes onto 3D voxel queries encodes high-level visual geometries and complements the loss of spatial information from reduced query resolutions. Additionally, we design a multi-perspective decoding strategy to efficiently disentangle the densely compressed visual cues into a high-dimensional 3D occupancy scene. Experimental results on both Occ3D and SemanticKITTI benchmarks demonstrate the effectiveness of the proposed method, showing clear improvements over the baselines. More importantly, ProtoOcc achieves competitive performance against the baselines even with 75\% reduced voxel resolution.

CVOct 16, 2025
CALM-Net: Curvature-Aware LiDAR Point Cloud-based Multi-Branch Neural Network for Vehicle Re-Identification

Dongwook Lee, Sol Han, Jinwhan Kim

This paper presents CALM-Net, a curvature-aware LiDAR point cloud-based multi-branch neural network for vehicle re-identification. The proposed model addresses the challenge of learning discriminative and complementary features from three-dimensional point clouds to distinguish between vehicles. CALM-Net employs a multi-branch architecture that integrates edge convolution, point attention, and a curvature embedding that characterizes local surface variation in point clouds. By combining these mechanisms, the model learns richer geometric and contextual features that are well suited for the re-identification task. Experimental evaluation on the large-scale nuScenes dataset demonstrates that CALM-Net achieves a mean re-identification accuracy improvement of approximately 1.97\% points compared with the strongest baseline in our study. The results confirms the effectiveness of incorporating curvature information into deep learning architectures and highlight the benefit of multi-branch feature learning for LiDAR point cloud-based vehicle re-identification.

AIJul 10, 2025
MoSE: Skill-by-Skill Mixture-of-Experts Learning for Embodied Autonomous Machines

Lu Xu, Jiaqian Yu, Xiongfeng Peng et al.

To meet the growing demand for smarter, faster, and more efficient embodied AI solutions, we introduce a novel Mixture-of-Expert (MoE) method that significantly boosts reasoning and learning efficiency for embodied autonomous systems. General MoE models demand extensive training data and complex optimization, which limits their applicability in embodied AI such as autonomous driving (AD) and robotic manipulation. In this work, we propose a skill-oriented MoE called MoSE, which mimics the human learning and reasoning process skill-by-skill, step-by-step. We introduce a skill-oriented routing mechanism that begins with defining and annotating specific skills, enabling experts to identify the necessary competencies for various scenarios and reasoning tasks, thereby facilitating skill-by-skill learning. To better align with multi-step planning in human reasoning and in end-to-end driving models, we build a hierarchical skill dataset and pretrain the router to encourage the model to think step-by-step. Unlike other multi-round dialogues, MoSE integrates valuable auxiliary tasks (e.g. perception-prediction-planning for AD, and high-level and low-level planning for robots) in one single forward process without introducing any extra computational cost. With less than 3B sparsely activated parameters, our model effectively grows more diverse expertise and outperforms models on both AD corner-case reasoning tasks and robot reasoning tasks with less than 40% of the parameters.

CVMay 7, 2020
A Gaussian Process Upsampling Model for Improvements in Optical Character Recognition

Steven I Reeves, Dongwook Lee, Anurag Singh et al.

Optical Character Recognition and extraction is a key tool in the automatic evaluation of documents in a financial context. However, the image data provided to automated systems can have unreliable quality, and can be inherently low-resolution or downsampled and compressed by a transmitting program. In this paper, we illustrate the efficacy of a Gaussian Process upsampling model for the purposes of improving OCR and extraction through upsampling low resolution documents.

IVMay 10, 2019
Which Contrast Does Matter? Towards a Deep Understanding of MR Contrast using Collaborative GAN

Dongwook Lee, Won-Jin Moon, Jong Chul Ye

Thanks to the recent success of generative adversarial network (GAN) for image synthesis, there are many exciting GAN approaches that successfully synthesize MR image contrast from other images with different contrasts. These approaches are potentially important for image imputation problems, where complete set of data is often difficult to obtain and image synthesis is one of the key solutions for handling the missing data problem. Unfortunately, the lack of the scalability of the existing GAN-based image translation approaches poses a fundamental challenge to understand the nature of the MR contrast imputation problem: which contrast does matter? Here, we present a systematic approach using Collaborative Generative Adversarial Networks (CollaGAN), which enable the learning of the joint image manifold of multiple MR contrasts to investigate which contrasts are essential. Our experimental results showed that the exogenous contrast from contrast agents is not replaceable, but other endogenous contrast such as T1, T2, etc can be synthesized from other contrast. These findings may give important guidance to the acquisition protocol design for MR in real clinical environment.

CVJan 28, 2019
CollaGAN : Collaborative GAN for Missing Image Data Imputation

Dongwook Lee, Junyoung Kim, Won-Jin Moon et al.

In many applications requiring multiple inputs to obtain a desired output, if any of the input data is missing, it often introduces large amounts of bias. Although many techniques have been developed for imputing missing data, the image imputation is still difficult due to complicated nature of natural images. To address this problem, here we proposed a novel framework for missing image data imputation, called Collaborative Generative Adversarial Network (CollaGAN). CollaGAN converts an image imputation problem to a multi-domain images-to-image translation task so that a single generator and discriminator network can successfully estimate the missing data using the remaining clean data set. We demonstrate that CollaGAN produces the images with a higher visual quality compared to the existing competing approaches in various image imputation tasks.

CVApr 2, 2018
Deep Residual Learning for Accelerated MRI using Magnitude and Phase Networks

Dongwook Lee, Jaejun Yoo, Sungho Tak et al.

Accelerated magnetic resonance (MR) scan acquisition with compressed sensing (CS) and parallel imaging is a powerful method to reduce MR imaging scan time. However, many reconstruction algorithms have high computational costs. To address this, we investigate deep residual learning networks to remove aliasing artifacts from artifact corrupted images. The proposed deep residual learning networks are composed of magnitude and phase networks that are separately trained. If both phase and magnitude information are available, the proposed algorithm can work as an iterative k-space interpolation algorithm using framelet representation. When only magnitude data is available, the proposed approach works as an image domain post-processing algorithm. Even with strong coherent aliasing artifacts, the proposed network successfully learned and removed the aliasing artifacts, whereas current parallel and CS reconstruction methods were unable to remove these artifacts. Comparisons using single and multiple coil show that the proposed residual network provides good reconstruction results with orders of magnitude faster computational time than existing compressed sensing methods. The proposed deep learning framework may have a great potential for accelerated MR reconstruction by generating accurate results immediately.

CVMar 3, 2017
Deep artifact learning for compressed sensing and parallel MRI

Dongwook Lee, Jaejun Yoo, Jong Chul Ye

Purpose: Compressed sensing MRI (CS-MRI) from single and parallel coils is one of the powerful ways to reduce the scan time of MR imaging with performance guarantee. However, the computational costs are usually expensive. This paper aims to propose a computationally fast and accurate deep learning algorithm for the reconstruction of MR images from highly down-sampled k-space data. Theory: Based on the topological analysis, we show that the data manifold of the aliasing artifact is easier to learn from a uniform subsampling pattern with additional low-frequency k-space data. Thus, we develop deep aliasing artifact learning networks for the magnitude and phase images to estimate and remove the aliasing artifacts from highly accelerated MR acquisition. Methods: The aliasing artifacts are directly estimated from the distorted magnitude and phase images reconstructed from subsampled k-space data so that we can get an aliasing-free images by subtracting the estimated aliasing artifact from corrupted inputs. Moreover, to deal with the globally distributed aliasing artifact, we develop a multi-scale deep neural network with a large receptive field. Results: The experimental results confirm that the proposed deep artifact learning network effectively estimates and removes the aliasing artifacts. Compared to existing CS methods from single and multi-coli data, the proposed network shows minimal errors by removing the coherent aliasing artifacts. Furthermore, the computational time is by order of magnitude faster. Conclusion: As the proposed deep artifact learning network immediately generates accurate reconstruction, it has great potential for clinical applications.