Dohoon Kim

CL
h-index8
8papers
26citations
Novelty46%
AI Score57

8 Papers

CVDec 11, 2023Code
MAFA: Managing False Negatives for Vision-Language Pre-training

Jaeseok Byun, Dohoon Kim, Taesup Moon

We consider a critical issue of false negatives in Vision-Language Pre-training (VLP), a challenge that arises from the inherent many-to-many correspondence of image-text pairs in large-scale web-crawled datasets. The presence of false negatives can impede achieving optimal performance and even lead to a significant performance drop. To address this challenge, we propose MAFA (MAnaging FAlse negatives), which consists of two pivotal components building upon the recently developed GRouped mIni-baTch sampling (GRIT) strategy: 1) an efficient connection mining process that identifies and converts false negatives into positives, and 2) label smoothing for the image-text contrastive (ITC) loss. Our comprehensive experiments verify the effectiveness of MAFA across multiple downstream tasks, emphasizing the crucial role of addressing false negatives in VLP, potentially even surpassing the importance of addressing false positives. In addition, the compatibility of MAFA with the recent BLIP-family model is also demonstrated. Code is available at https://github.com/jaeseokbyun/MAFA.

CLJan 5
K-EXAONE Technical Report

Eunbi Choi, Kibong Choi, Seokhee Hong et al.

This technical report presents K-EXAONE, a large-scale multilingual language model developed by LG AI Research. K-EXAONE is built on a Mixture-of-Experts architecture with 236B total parameters, activating 23B parameters during inference. It supports a 256K-token context window and covers six languages: Korean, English, Spanish, German, Japanese, and Vietnamese. We evaluate K-EXAONE on a comprehensive benchmark suite spanning reasoning, agentic, general, Korean, and multilingual abilities. Across these evaluations, K-EXAONE demonstrates performance comparable to open-weight models of similar size. K-EXAONE, designed to advance AI for a better life, is positioned as a powerful proprietary AI foundation model for a wide range of industrial and research applications.

DBApr 3
LitMOF: An LLM Multi-Agent for Literature-Validated Metal-Organic Frameworks Database Correction and Expansion

Honghui Kim, Dohoon Kim, Jihan Kim

Metal-organic framework (MOF) databases have grown rapidly through experimental deposition and large-scale literature extraction, but recent analyses show that nearly half of their entries contain substantial structural errors. These inaccuracies propagate through high-throughput screening and machine-learning workflows, limiting the reliability of data-driven MOF discovery. Correcting such errors is exceptionally difficult because true repairs require integrating crystallographic files, synthesis descriptions, and contextual evidence scattered across the literature. Here we introduce LitMOF, a large language model-driven multi-agent framework that validates crystallographic information directly from the original literature and cross-validates it with database entries to repair structural errors. Applying LitMOF to the experimental MOF database (the CSD MOF Subset), we constructed LitMOF-DB, a curated set of 186,773 computation-ready structures, including the successful repair of 8,771 invalid entries, which accounts for 65.3% of the not-computation-ready MOFs in the latest CoRE MOF database. Additionally, the system uncovered 12,646 experimentally reported MOFs absent from existing resources, substantially expanding the known experimental design space. Using direct air capture screening as a case study, we demonstrate that structural errors severely distort predicted adsorption energies and CO2/H2O selectivity, leading to systematic misranking of materials, false positives, and the omission of high-performance candidates. This work establishes a scalable pathway toward self-correcting scientific databases and a generalizable paradigm for LLM-driven curation in materials science.

CLApr 9
EXAONE 4.5 Technical Report

Eunbi Choi, Kibong Choi, Sehyun Chun et al.

This technical report introduces EXAONE 4.5, the first open-weight vision language model released by LG AI Research. EXAONE 4.5 is architected by integrating a dedicated visual encoder into the existing EXAONE 4.0 framework, enabling native multimodal pretraining over both visual and textual modalities. The model is trained on large-scale data with careful curation, particularly emphasizing document-centric corpora that align with LG's strategic application domains. This targeted data design enables substantial performance gains in document understanding and related tasks, while also delivering broad improvements across general language capabilities. EXAONE 4.5 extends context length up to 256K tokens, facilitating long-context reasoning and enterprise-scale use cases. Comparative evaluations demonstrate that EXAONE 4.5 achieves competitive performance in general benchmarks while outperforming state-of-the-art models of similar scale in document understanding and Korean contextual reasoning. As part of LG's ongoing effort toward practical industrial deployment, EXAONE 4.5 is designed to be continuously extended with additional domains and application scenarios to advance AI for a better life.

CLJul 3, 2025Code
DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning

Dohoon Kim, Donghun Kang, Taesup Moon

Domain-Adaptive Pre-training (DAP) has recently gained attention for its effectiveness in fine-tuning pre-trained models. Building on this, continual DAP has been explored to develop pre-trained models capable of incrementally incorporating different domain datasets. However, existing continual DAP methods face several limitations: (1) high computational cost and GPU memory usage during training; (2) sensitivity to incremental data order; and (3) providing a single, generalized model for all end tasks, which contradicts the essence of DAP. In this paper, we propose DoMIX, a novel approach that addresses these challenges by leveraging LoRA modules, a representative parameter-efficient fine-tuning (PEFT) method. Our approach enables efficient and parallel domain-adaptive pre-training that is robust to domain order and effectively utilizes accumulated knowledge to provide tailored pre-trained models for specific tasks. We also demonstrate that our method can be extended beyond the DAP setting to standard LLM fine-tuning scenarios. Code is available at https://github.com/dohoonkim-ai/DoMIX.

CVMay 2, 2023Code
PU-EdgeFormer: Edge Transformer for Dense Prediction in Point Cloud Upsampling

Dohoon Kim, Minwoo Shin, Joonki Paik

Despite the recent development of deep learning-based point cloud upsampling, most MLP-based point cloud upsampling methods have limitations in that it is difficult to train the local and global structure of the point cloud at the same time. To solve this problem, we present a combined graph convolution and transformer for point cloud upsampling, denoted by PU-EdgeFormer. The proposed method constructs EdgeFormer unit that consists of graph convolution and multi-head self-attention modules. We employ graph convolution using EdgeConv, which learns the local geometry and global structure of point cloud better than existing point-to-feature method. Through in-depth experiments, we confirmed that the proposed method has better point cloud upsampling performance than the existing state-of-the-art method in both subjective and objective aspects. The code is available at https://github.com/dohoon2045/PU-EdgeFormer.

COApr 21
Lions and Contamination: Trees and General Graphs

Dohoon Kim, Eungyu Woo, Donghoon Shin

This paper investigates a special variant of a pursuit-evasion game called lions and contamination. In a graph where all vertices are initially contaminated, a set of lions traverses the graph, clearing the contamination from every vertex they visit. However, the contamination simultaneously spreads to any adjacent vertex not occupied by a lion. We analyze the relationships among the lion number $\mathcal{L}(G)$, monotone lion number $\mathcal{L}^m(G)$, and the graph's pathwidth $\operatorname{pw}(G)$. Our main results are as follows: (a) We prove a monotonicity property: for any graph $G$ and its isometric subgraph $H$, $\mathcal{L}(H)\le \mathcal{L}(G)$. (b) For trees $T$, we show that the lion number is tightly characterized by pathwidth, satisfying $\operatorname{pw}(T)\le \mathcal{L}(T)\le \operatorname{pw}(T)+1$. (c) We provide a counterexample showing that the monotonicity property fails for arbitrary subgraphs. (d) We show that, in contrast to the tree case, pathwidth does not yield a general lower bound on $\mathcal{L}(G)$ for arbitrary graphs. (e) For any connected graph $G$, we prove the general upper bound $\mathcal{L}(G)\le \operatorname{pw}(G)+1$. (f) For the monotone variant, we establish the general lower bound $\operatorname{pw}(G)\le \mathcal{L}^m(G)$. (g) Conversely, we show that $\mathcal{L}^m(G)\le 2\operatorname{pw}(G)+2$ holds for all connected graphs, which is best possible up to a small additive constant.

AIMay 22, 2025
MADCluster: Model-agnostic Anomaly Detection with Self-supervised Clustering Network

Sangyong Lee, Subo Hwang, Dohoon Kim

In this paper, we propose MADCluster, a novel model-agnostic anomaly detection framework utilizing self-supervised clustering. MADCluster is applicable to various deep learning architectures and addresses the 'hypersphere collapse' problem inherent in existing deep learning-based anomaly detection methods. The core idea is to cluster normal pattern data into a 'single cluster' while simultaneously learning the cluster center and mapping data close to this center. Also, to improve expressiveness and enable effective single clustering, we propose a new 'One-directed Adaptive loss'. The optimization of this loss is mathematically proven. MADCluster consists of three main components: Base Embedder capturing high-dimensional temporal dynamics, Cluster Distance Mapping, and Sequence-wise Clustering for continuous center updates. Its model-agnostic characteristics are achieved by applying various architectures to the Base Embedder. Experiments on four time series benchmark datasets demonstrate that applying MADCluster improves the overall performance of comparative models. In conclusion, the compatibility of MADCluster shows potential for enhancing model performance across various architectures.