Dongki Jung

CV
h-index11
15papers
85citations
Novelty51%
AI Score54

15 Papers

CVMar 27, 2023
TMO: Textured Mesh Acquisition of Objects with a Mobile Device by using Differentiable Rendering

Jaehoon Choi, Dongki Jung, Taejae Lee et al.

We present a new pipeline for acquiring a textured mesh in the wild with a single smartphone which offers access to images, depth maps, and valid poses. Our method first introduces an RGBD-aided structure from motion, which can yield filtered depth maps and refines camera poses guided by corresponding depth. Then, we adopt the neural implicit surface reconstruction method, which allows for high-quality mesh and develops a new training process for applying a regularization provided by classical multi-view stereo methods. Moreover, we apply a differentiable rendering to fine-tune incomplete texture maps and generate textures which are perceptually closer to the original scene. Our pipeline can be applied to any common objects in the real world without the need for either in-the-lab environments or accurate mask images. We demonstrate results of captured objects with complex shapes and validate our method numerically against existing 3D reconstruction and texture mapping methods.

CVMar 10, 2022
SelfTune: Metrically Scaled Monocular Depth Estimation through Self-Supervised Learning

Jaehoon Choi, Dongki Jung, Yonghan Lee et al.

Monocular depth estimation in the wild inherently predicts depth up to an unknown scale. To resolve scale ambiguity issue, we present a learning algorithm that leverages monocular simultaneous localization and mapping (SLAM) with proprioceptive sensors. Such monocular SLAM systems can provide metrically scaled camera poses. Given these metric poses and monocular sequences, we propose a self-supervised learning method for the pre-trained supervised monocular depth networks to enable metrically scaled depth estimation. Our approach is based on a teacher-student formulation which guides our network to predict high-quality depths. We demonstrate that our approach is useful for various applications such as mobile robot navigation and is applicable to diverse environments. Our full system shows improvements over recent self-supervised depth estimation and completion methods on EuRoC, OpenLORIS, and ScanNet datasets.

74.1CVMay 13
PanoPlane: Plane-Aware Panoramic Completion for Sparse-View Indoor 3D Gaussian Splatting

Adil Qureshi, Dongki Jung, Jaehoon Choi et al.

We present PanoPlane, an approach for high-fidelity sparse-view indoor novel view synthesis that reconstructs closed room geometry via panoramic scene completion. Unlike perspective-based methods that generate training views from limited fields of view, PanoPlane leverages $360^{\circ}$ panoramic completion to condition the generative process on the full spatial layout. We propose Layout Anchored Attention Steering, a training-free mechanism that steers attention within the diffusion model's internal representation toward scene's detected planar surfaces at inference time. By directing each unobserved region's attention toward geometrically consistent observed content, our method replaces unconstrained hallucination with grounded surface extrapolation. The resulting panoramic completions provide supervision for 3D Gaussian Splatting, enabling accurate novel-view synthesis across unobserved regions from as few as three input views. Experiments on Replica, ScanNet++, and Matterport3D demonstrate state-of-the-art novel view synthesis quality across 3, 6, and 9 input views, achieving up to $+17.8\%$ improvement in PSNR over the current state-of-the-art baseline without any training or fine-tuning of the diffusion model.

CVFeb 5
Wid3R: Wide Field-of-View 3D Reconstruction via Camera Model Conditioning

Dongki Jung, Jaehoon Choi, Adil Qureshi et al.

We present Wid3R, a feed-forward neural network for visual geometry reconstruction that supports wide field-of-view camera models. Prior methods typically assume that input images are rectified or captured with pinhole cameras, since both their architectures and training datasets are tailored to perspective images only. These assumptions limit their applicability in real-world scenarios that use fisheye or panoramic cameras and often require careful calibration and undistortion. In contrast, Wid3R is a generalizable multi-view 3D estimation method that can model wide field-of-view camera types. Our approach leverages a ray representation with spherical harmonics and a novel camera model token within the network, enabling distortion-aware 3D reconstruction. Furthermore, Wid3R is the first multi-view foundation model to support feed-forward 3D reconstruction directly from 360 imagery. It demonstrates strong zero-shot robustness and consistently outperforms prior methods, achieving improvements of up to +77.33 on Stanford2D3D.

CVFeb 28, 2025
EDM: Equirectangular Projection-Oriented Dense Kernelized Feature Matching

Dongki Jung, Jaehoon Choi, Yonghan Lee et al.

We introduce the first learning-based dense matching algorithm, termed Equirectangular Projection-Oriented Dense Kernelized Feature Matching (EDM), specifically designed for omnidirectional images. Equirectangular projection (ERP) images, with their large fields of view, are particularly suited for dense matching techniques that aim to establish comprehensive correspondences across images. However, ERP images are subject to significant distortions, which we address by leveraging the spherical camera model and geodesic flow refinement in the dense matching method. To further mitigate these distortions, we propose spherical positional embeddings based on 3D Cartesian coordinates of the feature grid. Additionally, our method incorporates bidirectional transformations between spherical and Cartesian coordinate systems during refinement, utilizing a unit sphere to improve matching performance. We demonstrate that our proposed method achieves notable performance enhancements, with improvements of +26.72 and +42.62 in AUC@5° on the Matterport3D and Stanford2D3D datasets.

CVApr 2, 2025
UAVTwin: Neural Digital Twins for UAVs using Gaussian Splatting

Jaehoon Choi, Dongki Jung, Yonghan Lee et al.

We present UAVTwin, a method for creating digital twins from real-world environments and facilitating data augmentation for training downstream models embedded in unmanned aerial vehicles (UAVs). Specifically, our approach focuses on synthesizing foreground components, such as various human instances in motion within complex scene backgrounds, from UAV perspectives. This is achieved by integrating 3D Gaussian Splatting (3DGS) for reconstructing backgrounds along with controllable synthetic human models that display diverse appearances and actions in multiple poses. To the best of our knowledge, UAVTwin is the first approach for UAV-based perception that is capable of generating high-fidelity digital twins based on 3DGS. The proposed work significantly enhances downstream models through data augmentation for real-world environments with multiple dynamic objects and significant appearance variations-both of which typically introduce artifacts in 3DGS-based modeling. To tackle these challenges, we propose a novel appearance modeling strategy and a mask refinement module to enhance the training of 3D Gaussian Splatting. We demonstrate the high quality of neural rendering by achieving a 1.23 dB improvement in PSNR compared to recent methods. Furthermore, we validate the effectiveness of data augmentation by showing a 2.5% to 13.7% improvement in mAP for the human detection task.

CVOct 8, 2025
MoRe: Monocular Geometry Refinement via Graph Optimization for Cross-View Consistency

Dongki Jung, Jaehoon Choi, Yonghan Lee et al.

Monocular 3D foundation models offer an extensible solution for perception tasks, making them attractive for broader 3D vision applications. In this paper, we propose MoRe, a training-free Monocular Geometry Refinement method designed to improve cross-view consistency and achieve scale alignment. To induce inter-frame relationships, our method employs feature matching between frames to establish correspondences. Rather than applying simple least squares optimization on these matched points, we formulate a graph-based optimization framework that performs local planar approximation using the estimated 3D points and surface normals estimated by monocular foundation models. This formulation addresses the scale ambiguity inherent in monocular geometric priors while preserving the underlying 3D structure. We further demonstrate that MoRe not only enhances 3D reconstruction but also improves novel view synthesis, particularly in sparse view rendering scenarios.

CVSep 28, 2025
RPG360: Robust 360 Depth Estimation with Perspective Foundation Models and Graph Optimization

Dongki Jung, Jaehoon Choi, Yonghan Lee et al.

The increasing use of 360 images across various domains has emphasized the need for robust depth estimation techniques tailored for omnidirectional images. However, obtaining large-scale labeled datasets for 360 depth estimation remains a significant challenge. In this paper, we propose RPG360, a training-free robust 360 monocular depth estimation method that leverages perspective foundation models and graph optimization. Our approach converts 360 images into six-face cubemap representations, where a perspective foundation model is employed to estimate depth and surface normals. To address depth scale inconsistencies across different faces of the cubemap, we introduce a novel depth scale alignment technique using graph-based optimization, which parameterizes the predicted depth and normal maps while incorporating an additional per-face scale parameter. This optimization ensures depth scale consistency across the six-face cubemap while preserving 3D structural integrity. Furthermore, as foundation models exhibit inherent robustness in zero-shot settings, our method achieves superior performance across diverse datasets, including Matterport3D, Stanford2D3D, and 360Loc. We also demonstrate the versatility of our depth estimation approach by validating its benefits in downstream tasks such as feature matching 3.2 ~ 5.4% and Structure from Motion 0.2 ~ 9.7% in AUC@5.

CVJun 5, 2025
UAV4D: Dynamic Neural Rendering of Human-Centric UAV Imagery using Gaussian Splatting

Jaehoon Choi, Dongki Jung, Christopher Maxey et al.

Despite significant advancements in dynamic neural rendering, existing methods fail to address the unique challenges posed by UAV-captured scenarios, particularly those involving monocular camera setups, top-down perspective, and multiple small, moving humans, which are not adequately represented in existing datasets. In this work, we introduce UAV4D, a framework for enabling photorealistic rendering for dynamic real-world scenes captured by UAVs. Specifically, we address the challenge of reconstructing dynamic scenes with multiple moving pedestrians from monocular video data without the need for additional sensors. We use a combination of a 3D foundation model and a human mesh reconstruction model to reconstruct both the scene background and humans. We propose a novel approach to resolve the scene scale ambiguity and place both humans and the scene in world coordinates by identifying human-scene contact points. Additionally, we exploit the SMPL model and background mesh to initialize Gaussian splats, enabling holistic scene rendering. We evaluated our method on three complex UAV-captured datasets: VisDrone, Manipal-UAV, and Okutama-Action, each with distinct characteristics and 10~50 humans. Our results demonstrate the benefits of our approach over existing methods in novel view synthesis, achieving a 1.5 dB PSNR improvement and superior visual sharpness.

CVFeb 18, 2025
IM360: Large-scale Indoor Mapping with 360 Cameras

Dongki Jung, Jaehoon Choi, Yonghan Lee et al.

We present a novel 3D mapping pipeline for large-scale indoor environments. To address the significant challenges in large-scale indoor scenes, such as prevalent occlusions and textureless regions, we propose IM360, a novel approach that leverages the wide field of view of omnidirectional images and integrates the spherical camera model into the Structure-from-Motion (SfM) pipeline. Our SfM utilizes dense matching features specifically designed for 360 images, demonstrating superior capability in image registration. Furthermore, with the aid of mesh-based neural rendering techniques, we introduce a texture optimization method that refines texture maps and accurately captures view-dependent properties by combining diffuse and specular components. We evaluate our pipeline on large-scale indoor scenes, demonstrating its effectiveness in real-world scenarios. In practice, IM360 demonstrates superior performance, achieving a 3.5 PSNR increase in textured mesh reconstruction. We attain state-of-the-art performance in terms of camera localization and registration on Matterport3D and Stanford2D3D.

CVAug 12, 2021
DnD: Dense Depth Estimation in Crowded Dynamic Indoor Scenes

Dongki Jung, Jaehoon Choi, Yonghan Lee et al.

We present a novel approach for estimating depth from a monocular camera as it moves through complex and crowded indoor environments, e.g., a department store or a metro station. Our approach predicts absolute scale depth maps over the entire scene consisting of a static background and multiple moving people, by training on dynamic scenes. Since it is difficult to collect dense depth maps from crowded indoor environments, we design our training framework without requiring depths produced from depth sensing devices. Our network leverages RGB images and sparse depth maps generated from traditional 3D reconstruction methods to estimate dense depth maps. We use two constraints to handle depth for non-rigidly moving people without tracking their motion explicitly. We demonstrate that our approach offers consistent improvements over recent depth estimation methods on the NAVERLABS dataset, which includes complex and crowded scenes.

CVNov 10, 2020
SelfDeco: Self-Supervised Monocular Depth Completion in Challenging Indoor Environments

Jaehoon Choi, Dongki Jung, Yonghan Lee et al.

We present a novel algorithm for self-supervised monocular depth completion. Our approach is based on training a neural network that requires only sparse depth measurements and corresponding monocular video sequences without dense depth labels. Our self-supervised algorithm is designed for challenging indoor environments with textureless regions, glossy and transparent surface, non-Lambertian surfaces, moving people, longer and diverse depth ranges and scenes captured by complex ego-motions. Our novel architecture leverages both deep stacks of sparse convolution blocks to extract sparse depth features and pixel-adaptive convolutions to fuse image and depth features. We compare with existing approaches in NYUv2, KITTI, and NAVERLABS indoor datasets, and observe 5-34 % improvements in root-means-square error (RMSE) reduction.

CVOct 6, 2020
SAFENet: Self-Supervised Monocular Depth Estimation with Semantic-Aware Feature Extraction

Jaehoon Choi, Dongki Jung, Donghwan Lee et al.

Self-supervised monocular depth estimation has emerged as a promising method because it does not require groundtruth depth maps during training. As an alternative for the groundtruth depth map, the photometric loss enables to provide self-supervision on depth prediction by matching the input image frames. However, the photometric loss causes various problems, resulting in less accurate depth values compared with supervised approaches. In this paper, we propose SAFENet that is designed to leverage semantic information to overcome the limitations of the photometric loss. Our key idea is to exploit semantic-aware depth features that integrate the semantic and geometric knowledge. Therefore, we introduce multi-task learning schemes to incorporate semantic-awareness into the representation of depth features. Experiments on KITTI dataset demonstrate that our methods compete or even outperform the state-of-the-art methods. Furthermore, extensive experiments on different datasets show its better generalization ability and robustness to various conditions, such as low-light or adverse weather.

CVOct 6, 2020
Arbitrary Style Transfer using Graph Instance Normalization

Dongki Jung, Seunghan Yang, Jaehoon Choi et al.

Style transfer is the image synthesis task, which applies a style of one image to another while preserving the content. In statistical methods, the adaptive instance normalization (AdaIN) whitens the source images and applies the style of target images through normalizing the mean and variance of features. However, computing feature statistics for each instance would neglect the inherent relationship between features, so it is hard to learn global styles while fitting to the individual training dataset. In this paper, we present a novel learnable normalization technique for style transfer using graph convolutional networks, termed Graph Instance Normalization (GrIN). This algorithm makes the style transfer approach more robust by taking into account similar information shared between instances. Besides, this simple module is also applicable to other tasks like image-to-image translation or domain adaptation.

CVMay 16, 2020
Partial Domain Adaptation Using Graph Convolutional Networks

Seunghan Yang, Youngeun Kim, Dongki Jung et al.

Partial domain adaptation (PDA), in which we assume the target label space is included in the source label space, is a general version of standard domain adaptation. Since the target label space is unknown, the main challenge of PDA is to reduce the learning impact of irrelevant source samples, named outliers, which do not belong to the target label space. Although existing partial domain adaptation methods effectively down-weigh outliers' importance, they do not consider data structure of each domain and do not directly align the feature distributions of the same class in the source and target domains, which may lead to misalignment of category-level distributions. To overcome these problems, we propose a graph partial domain adaptation (GPDA) network, which exploits Graph Convolutional Networks for jointly considering data structure and the feature distribution of each class. Specifically, we propose a label relational graph to align the distributions of the same category in two domains and introduce moving average centroid separation for learning networks from the label relational graph. We demonstrate that considering data structure and the distribution of each category is effective for PDA and our GPDA network achieves state-of-the-art performance on the Digit and Office-31 datasets.