Dong-Hee Kim

STAT-MECH
h-index1
8papers
6citations
Novelty51%
AI Score39

8 Papers

67.8CVMay 25
Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents

Dong-Hee Kim, Reuben Tan, Donghyun Kim

Visual agents employ external visual tools within visual chains of thought to incorporate fine-grained evidence. While prior work has mainly studied these tools in visual search tasks, their role in more complex visual reasoning remains underexplored. In this paper, we move beyond simple visual search tasks to investigate more challenging tasks, including 3D spatial reasoning and medical visual question answering, where agents must integrate tool-acquired local evidence with the global context. We identify a {tool-use collapse phenomenon: models progressively stop using tools while still achieving higher task accuracy. Moreover, we observe a clear asymmetry: (i) completely eliminating tool use degrades performance, whereas (ii) incentivizing tool use yields only marginal gains despite substantially increasing usage. We find that vanilla training and tool-use encouragement both reduce rollout diversity, explaining why higher tool use does not yield stronger reasoning performance. Motivated by these findings, we add an entropy regularization term to encourage diverse rollout exploration, achieving the best performance despite gradually declining tool usage. % We further observe similar dynamics on medical VQA, suggesting that tool-use collapse is not limited to 3D spatial reasoning. Overall, our findings suggest a training-time view of tools as scaffolding, where broader exploration over language generation and visual tool invocation improves reasoning despite tool-use collapse. Project page: https://scaffolded-exploration.github.io

97.0ROMar 13
Spatially Grounded Long-Horizon Task Planning in the Wild

Sehun Jung, HyunJee Song, Dong-Hee Kim et al.

Recent advances in robot manipulation increasingly leverage Vision-Language Models (VLMs) for high-level reasoning, such as decomposing task instructions into sequential action plans expressed in natural language that guide downstream low-level motor execution. However, current benchmarks do not assess whether these plans are spatially executable, particularly in specifying the exact spatial locations where the robot should interact to execute the plan, limiting evaluation of real-world manipulation capability. To bridge this gap, we define a novel task of grounded planning and introduce GroundedPlanBench, a newly curated benchmark for spatially grounded long-horizon action planning in the wild. GroundedPlanBench jointly evaluates hierarchical sub-action planning and spatial action grounding (where to act), enabling systematic assessment of whether generated sub-actions are spatially executable for robot manipulation. We further introduce Video-to-Spatially Grounded Planning (V2GP), an automated data generation framework that leverages real-world robot video demonstrations to improve spatially grounded long-horizon planning. Our evaluations reveal that spatially grounded long-horizon planning remains a major bottleneck for current VLMs. Our results demonstrate that V2GP provides a promising approach for improving both action planning and spatial grounding performance, validated on our benchmark as well as through real-world robot manipulation experiments, advancing progress toward spatially actionable planning.

LGMay 23, 2025Code
SynRES: Towards Referring Expression Segmentation in the Wild via Synthetic Data

Dong-Hee Kim, Hyunjee Song, Donghyun Kim

Despite the advances in Referring Expression Segmentation (RES) benchmarks, their evaluation protocols remain constrained, primarily focusing on either single targets with short queries (containing minimal attributes) or multiple targets from distinctly different queries on a single domain. This limitation significantly hinders the assessment of more complex reasoning capabilities in RES models. We introduce WildRES, a novel benchmark that incorporates long queries with diverse attributes and non-distinctive queries for multiple targets. This benchmark spans diverse application domains, including autonomous driving environments and robotic manipulation scenarios, thus enabling more rigorous evaluation of complex reasoning capabilities in real-world settings. Our analysis reveals that current RES models demonstrate substantial performance deterioration when evaluated on WildRES. To address this challenge, we introduce SynRES, an automated pipeline generating densely paired compositional synthetic training data through three innovations: (1) a dense caption-driven synthesis for attribute-rich image-mask-expression triplets, (2) reliable semantic alignment mechanisms rectifying caption-pseudo mask inconsistencies via Image-Text Aligned Grouping, and (3) domain-aware augmentations incorporating mosaic composition and superclass replacement to emphasize generalization ability and distinguishing attributes over object categories. Experimental results demonstrate that models trained with SynRES achieve state-of-the-art performance, improving gIoU by 2.0% on WildRES-ID and 3.8% on WildRES-DS. Code and datasets are available at https://github.com/UTLLab/SynRES.

CVJan 26, 2024Code
CNG-SFDA:Clean-and-Noisy Region Guided Online-Offline Source-Free Domain Adaptation

Hyeonwoo Cho, Chanmin Park, Dong-Hee Kim et al.

Domain shift occurs when training (source) and test (target) data diverge in their distribution. Source-Free Domain Adaptation (SFDA) addresses this domain shift problem, aiming to adopt a trained model on the source domain to the target domain in a scenario where only a well-trained source model and unlabeled target data are available. In this scenario, handling false labels in the target domain is crucial because they negatively impact the model performance. To deal with this problem, we propose to update cluster prototypes (i.e., centroid of each sample cluster) and their structure in the target domain formulated by the source model in online manners. In the feature space, samples in different regions have different pseudo-label distribution characteristics affected by the cluster prototypes, and we adopt distinct training strategies for these samples by defining clean and noisy regions: we selectively train the target with clean pseudo-labels in the clean region, whereas we introduce mix-up inputs representing intermediate features between clean and noisy regions to increase the compactness of the cluster. We conducted extensive experiments on multiple datasets in online/offline SFDA settings, whose results demonstrate that our method, CNG-SFDA, achieves state-of-the-art for most cases. Code is available at https://github.com/hyeonwoocho7/CNG-SFDA.

STAT-MECHAug 18, 2023
Neural-network quantum state study of the long-range antiferromagnetic Ising chain

Jicheol Kim, Dongkyu Kim, Dong-Hee Kim

We investigate quantum phase transitions in the transverse field Ising chain with algebraically decaying long-range (LR) antiferromagnetic interactions using the variational Monte Carlo method with the restricted Boltzmann machine employed as a trial wave function ansatz. First, we measure the critical exponents and the central charge through the finite-size scaling analysis, verifying the contrasting observations in the previous tensor network studies. The correlation function exponent and the central charge deviate from the short-range (SR) Ising values at a small decay exponent $α_\mathrm{LR}$, while the other critical exponents examined are very close to the SR Ising exponents regardless of $α_\mathrm{LR}$ examined. However, in the further test of the critical Binder ratio, we find that the universal ratio of the SR limit does not hold for $α_\mathrm{LR} < 2$, implying a deviation in the criticality. On the other hand, we find evidence of the conformal invariance breakdown in the conformal field theory (CFT) test of the correlation function. The deviation from the CFT description becomes more pronounced as $α_\mathrm{LR}$ decreases, although a precise breakdown threshold is yet to be determined.

CVJul 15, 2024
Joint-Embedding Predictive Architecture for Self-Supervised Learning of Mask Classification Architecture

Dong-Hee Kim, Sungduk Cho, Hyeonwoo Cho et al.

In this work, we introduce Mask-JEPA, a self-supervised learning framework tailored for mask classification architectures (MCA), to overcome the traditional constraints associated with training segmentation models. Mask-JEPA combines a Joint Embedding Predictive Architecture with MCA to adeptly capture intricate semantics and precise object boundaries. Our approach addresses two critical challenges in self-supervised learning: 1) extracting comprehensive representations for universal image segmentation from a pixel decoder, and 2) effectively training the transformer decoder. The use of the transformer decoder as a predictor within the JEPA framework allows proficient training in universal image segmentation tasks. Through rigorous evaluations on datasets such as ADE20K, Cityscapes and COCO, Mask-JEPA demonstrates not only competitive results but also exceptional adaptability and robustness across various training scenarios. The architecture-agnostic nature of Mask-JEPA further underscores its versatility, allowing seamless adaptation to various mask classification family.

STAT-MECHOct 1, 2020
Emergence of a finite-size-scaling function in the supervised learning of the Ising phase transition

Dongkyu Kim, Dong-Hee Kim

We investigate the connection between the supervised learning of the binary phase classification in the ferromagnetic Ising model and the standard finite-size-scaling theory of the second-order phase transition. Proposing a minimal one-free-parameter neural network model, we analytically formulate the supervised learning problem for the canonical ensemble being used as a training data set. We show that just one free parameter is capable enough to describe the data-driven emergence of the universal finite-size-scaling function in the network output that is observed in a large neural network, theoretically validating its critical point prediction for unseen test data from different underlying lattices yet in the same universality class of the Ising criticality. We also numerically demonstrate the interpretation with the proposed one-parameter model by providing an example of finding a critical point with the learning of the Landau mean-field free energy being applied to the real data set from the uncorrelated random scale-free graph with a large degree exponent.

LGJun 28, 2020
A Confidence-Calibrated MOBA Game Winner Predictor

Dong-Hee Kim, Changwoo Lee, Ki-Seok Chung

In this paper, we propose a confidence-calibration method for predicting the winner of a famous multiplayer online battle arena (MOBA) game, League of Legends. In MOBA games, the dataset may contain a large amount of input-dependent noise; not all of such noise is observable. Hence, it is desirable to attempt a confidence-calibrated prediction. Unfortunately, most existing confidence calibration methods are pertaining to image and document classification tasks where consideration on uncertainty is not crucial. In this paper, we propose a novel calibration method that takes data uncertainty into consideration. The proposed method achieves an outstanding expected calibration error (ECE) (0.57%) mainly owing to data uncertainty consideration, compared to a conventional temperature scaling method of which ECE value is 1.11%.