Youngsook Song

CL
h-index1
8papers
1,013citations
Novelty38%
AI Score55

8 Papers

66.2CVApr 13
What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models

Dasol Choi, Guijin Son, Hanwool Lee et al.

Current vision-language benchmarks predominantly feature well-structured questions with clear, explicit prompts. However, real user queries are often informal and underspecified. Users naturally leave much unsaid, relying on images to convey context. We introduce HAERAE-Vision, a benchmark of 653 real-world visual questions from Korean online communities (0.76% survival from 86K candidates), each paired with an explicit rewrite, yielding 1,306 query variants in total. Evaluating 39 VLMs, we find that even state-of-the-art models (GPT-5, Gemini 2.5 Pro) achieve under 50% on the original queries. Crucially, query explicitation alone yields 8 to 22 point improvements, with smaller models benefiting most. We further show that even with web search, under-specified queries underperform explicit queries without search, revealing that current retrieval cannot compensate for what users leave unsaid. Our findings demonstrate that a substantial portion of VLM difficulty stem from natural query under-specification instead of model capability, highlighting a critical gap between benchmark evaluation and real-world deployment.

61.4DCMay 10
From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

Daemyung Kang, Eunjin Hwang, Hanjeong Lee et al.

Large-scale AI training is now fundamentally a distributed systems problem, and hardware failures have become routine operating conditions rather than rare exceptions. Public operational evidence from production training clusters, however, remains scarce. This technical report presents an empirical analysis of a 63-node NVIDIA B200 production cluster (504 GPUs), using 55 days of Prometheus time-series data and 73 days of operational logs covering 224 multi-node training sessions. The cluster operates within a cross-organizational environment in which five parties (SKT, Upstage, Lablup, NVIDIA Korea, and VAST Data) share a unified monitoring pipeline. This arrangement enabled joint diagnosis of a 60-node-scale storage I/O bottleneck that did not appear at 2-4-node scale, a production-scale phenomenon no single team could isolate alone. Drawing on a months-long pre-training campaign, we perform three quantitative analyses yielding four findings. First, statistical analysis over 751 Prometheus metrics and 10 XID-identified GPU failures achieves a 10/10 detection rate (2/10 pre-XID) at ~0.84 false positives per day. No single metric is consistently dominant across failure types, motivating a multi-signal detection strategy. Second, profiling 523 checkpoint events along the GPU VRAM to NFS path attributes the "bandwidth paradox" (1.4-10.4% utilization of 200 Gbps RoCE) to saturation of the 128-slot NFS RPC layer. Third, multi-node failure response shows concentrated exclusions (top 3 of 63 nodes account for >50% of all exclusions) and an auto-retry chain success rate of 33.3% over 12 chains (73 attempts), 2.7x the 12.5% manual recovery rate; the median retry interval is 11 min (IQR 10-11). All analyses are grounded in production infrastructure providing session-level workload management, GPU-centric scheduling, and unified observability.

CLDec 31, 2020Code
Open Korean Corpora: A Practical Report

Won Ik Cho, Sangwhan Moon, Youngsook Song

Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.

CVMay 21, 2025
Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition

Dasol Choi, Seunghyun Lee, Youngsook Song

Vision-Language Models (VLMs) have shown capabilities in interpreting visual content, but their reliability in safety-critical scenarios remains insufficiently explored. We introduce VERI, a diagnostic benchmark comprising 200 synthetic images (100 contrastive pairs) and an additional 50 real-world images (25 pairs) for validation. Each emergency scene is paired with a visually similar but safe counterpart through human verification. Using a two-stage evaluation protocol (risk identification and emergency response), we assess 17 VLMs across medical emergencies, accidents, and natural disasters. Our analysis reveals an "overreaction problem": models achieve high recall (70-100%) but suffer from low precision, misclassifying 31-96% of safe situations as dangerous. Seven safe scenarios were universally misclassified by all models. This "better-safe-than-sorry" bias stems from contextual overinterpretation (88-98% of errors). Both synthetic and real-world datasets confirm these systematic patterns, challenging VLM reliability in safety-critical applications. Addressing this requires enhanced contextual reasoning in ambiguous visual situations.

CLJul 6, 2025
No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem

Dasol Choi, Woomyoung Park, Youngsook Song

Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages - particularly Chinese, Japanese, and Korean (CJK) - remains fragmented and underexplored, despite these languages together serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment- and subculture-focused emphasis on Japanese collections. By uncovering these patterns, we reveal practical strategies for enhancing dataset documentation, licensing clarity, and cross-lingual resource sharing - ultimately guiding more effective and culturally attuned LLM development in East Asia. We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.

CVJun 13, 2025
Stop learning it all to mitigate visual hallucination, Focus on the hallucination target

Dokyoon Yoon, Youngsook Song, Woomyong Park

Multimodal Large Language Models (MLLMs) frequently suffer from hallucination issues, generating information about objects that are not present in input images during vision-language tasks. These hallucinations particularly undermine model reliability in practical applications requiring accurate object identification. To address this challenge, we propose \mymethod,\ a preference learning approach that mitigates hallucinations by focusing on targeted areas where they occur. To implement this, we build a dataset containing hallucinated responses, correct responses, and target information (i.e., objects present in the images and the corresponding chunk positions in responses affected by hallucinations). By applying a preference learning method restricted to these specific targets, the model can filter out irrelevant signals and focus on correcting hallucinations. This allows the model to produce more factual responses by concentrating solely on relevant information. Experimental results demonstrate that \mymethod\ effectively reduces hallucinations across multiple vision hallucination tasks, improving the reliability and performance of MLLMs without diminishing overall performance.

CLJun 13, 2025
KoGEC : Korean Grammatical Error Correction with Pre-trained Translation Models

Taeeun Kim, Semin Jeong, Youngsook Song

This research introduces KoGEC, a Korean Grammatical Error Correction system using pre\--trained translation models. We fine-tuned NLLB (No Language Left Behind) models for Korean GEC, comparing their performance against large language models like GPT-4 and HCX-3. The study used two social media conversation datasets for training and testing. The NLLB models were fine-tuned using special language tokens to distinguish between original and corrected Korean sentences. Evaluation was done using BLEU scores and an "LLM as judge" method to classify error types. Results showed that the fine-tuned NLLB (KoGEC) models outperformed GPT-4o and HCX-3 in Korean GEC tasks. KoGEC demonstrated a more balanced error correction profile across various error types, whereas the larger LLMs tended to focus less on punctuation errors. We also developed a Chrome extension to make the KoGEC system accessible to users. Finally, we explored token vocabulary expansion to further improve the model but found it to decrease model performance. This research contributes to the field of NLP by providing an efficient, specialized Korean GEC system and a new evaluation method. It also highlights the potential of compact, task-specific models to compete with larger, general-purpose language models in specialized NLP tasks.

MLJul 6, 2017
Convergence Analysis of Optimization Algorithms

HyoungSeok Kim, JiHoon Kang, WooMyoung Park et al.

The regret bound of an optimization algorithms is one of the basic criteria for evaluating the performance of the given algorithm. By inspecting the differences between the regret bounds of traditional algorithms and adaptive one, we provide a guide for choosing an optimizer with respect to the given data set and the loss function. For analysis, we assume that the loss function is convex and its gradient is Lipschitz continuous.