43.0CLMay 21
Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the AnswerJewon Yeom, Jaewon Sok, Heejun Kim et al.
Hallucination is often viewed as a direct consequence of missing knowledge: a model answers incorrectly when the correct answer is absent from its generation-time distribution, and correctly when it is present. We test this assumption by introducing a semantic notion of answer availability that aggregates token-level variants expressing the same answer concept, and asks whether the correct concept is already available at the moment the model commits to an answer. Across Qwen and Llama models from 0.8B to 72B in both Instruct and Base variants, 16-47% of Instruct hallucinations occur with substantial probability mass already on the correct concept, and the rate rises monotonically with scale. Comparing such failures against correct generations with matched semantic support, the distinguishing factor is not whether the correct concept is represented, but how its probability is distributed: correct generations concentrate mass on a single surface form, hallucinations disperse it across alternatives. The same sharpening asymmetry extends across multi-token generation and is detectable in pre-generation hidden states. Together, these results identify a single mechanism: instruction tuning sharpens answer commitment with scale, making helpfulness and confident hallucination two consequences of the same underlying disposition.
57.7AIMay 12
From Noise to Diversity: Random Embedding Injection in LLM ReasoningHeejun Kim, Seungpil Lee, Jewon Yeom et al.
Recent soft prompt research has tried to improve reasoning by inserting trained vectors into LLM inputs, yet whether the gain comes from the learned content or from the act of injection itself has not been carefully separated. We study Random Soft Prompts (RSPs), which drop the training step entirely and append a freshly drawn sequence of random embedding vectors to the input. Each RSP vector is sampled from an isotropic Gaussian fitted to the entrywise mean and variance of the pretrained embedding table; the sequence carries no learned content, and yet reaches accuracy comparable to optimized soft prompts on math reasoning benchmarks in several settings. The mechanism unfolds in two stages: because attention has to absorb a never-seen-before random position, the distribution over the first few generated tokens flattens and reasoning trajectories branch, and as generation continues this influence dilutes naturally so the response commits to a single completion. We show that during inference RSPs lift early-stage token diversity and, combined with temperature sampling, widen Pass@N, the probability that at least one out of N attempts is correct. Beyond inference, we carry the same effect into DAPO training and demonstrate practical gains. Our contributions are: (i) RSP isolates the simplest form of soft prompt -- training-free, freshly resampled -- providing a unified lens for the structural effect of injection that variants otherwise differing in training and form all share; (ii) a theoretical and empirical validation of the underlying mechanism; and (iii) an extension from inference to training.
LGFeb 2
Efficient Epistemic Uncertainty Estimation for Large Language Models via Knowledge DistillationSeonghyeon Park, Jewon Yeom, Jaewon Sok et al.
Quantifying uncertainty in Large Language Models (LLMs) is essential for mitigating hallucinations and enabling risk-aware deployment in safety-critical tasks. However, estimating Epistemic Uncertainty(EU) via Deep Ensembles is computationally prohibitive at the scale of modern models. We propose a framework that leverages the small draft models to efficiently estimate token-level EU, bypassing the need for full-scale ensembling. Theoretically grounded in a Bias-Variance Decomposition, our approach approximates EU via Jensen-Shannon divergence among drafts (variance proxy) and KL divergence between the draft mixture and the target (bias proxy). To further ensure accuracy without significant overhead, we introduce Online Stochastic Distillation (OSD) to efficiently approximate target aggregation and the Data-Diverse Drafts (DDD) strategy to enhance draft diversity for better target approximation. Extensive experiments on GSM8K demonstrate that our method reduces the estimation error (RMSE) by up to 37% compared to baselines. Crucially, our approach achieves Hallucination Detection performance competitive with heavy perturbation-based methods like TokUR while incurring negligible inference costs, offering a practical solution for uncertainty-aware LLM deployment.
AIOct 15, 2024
Diffusion-Based Offline RL for Improved Decision-Making in Augmented ARC TaskYunho Kim, Jaehyun Park, Heejun Kim et al.
Effective long-term strategies enable AI systems to navigate complex environments by making sequential decisions over extended horizons. Similarly, reinforcement learning (RL) agents optimize decisions across sequences to maximize rewards, even without immediate feedback. To verify that Latent Diffusion-Constrained Q-learning (LDCQ), a prominent diffusion-based offline RL method, demonstrates strong reasoning abilities in multi-step decision-making, we aimed to evaluate its performance on the Abstraction and Reasoning Corpus (ARC). However, applying offline RL methodologies to enhance strategic reasoning in AI for solving tasks in ARC is challenging due to the lack of sufficient experience data in the ARC training set. To address this limitation, we introduce an augmented offline RL dataset for ARC, called Synthesized Offline Learning Data for Abstraction and Reasoning (SOLAR), along with the SOLAR-Generator, which generates diverse trajectory data based on predefined rules. SOLAR enables the application of offline RL methods by offering sufficient experience data. We synthesized SOLAR for a simple task and used it to train an agent with the LDCQ method. Our experiments demonstrate the effectiveness of the offline RL approach on a simple ARC task, showing the agent's ability to make multi-step sequential decisions and correctly identify answer states. These results highlight the potential of the offline RL approach to enhance AI's strategic reasoning capabilities.
SEJul 19, 2018
Loud and Interactive Paper Prototyping in Requirements Elicitation: What is it Good for?Zahra Shakeri Hossein Abad, Sania Moazzam, Christina Lo et al.
Requirements Engineering is a multidisciplinary and a human-centered process, therefore, the artifacts produced from RE are always error-prone. The most significant of these errors are missing or misunderstanding requirements. Information loss in RE could result in omitted logic in the software, which will be onerous to correct at the later stages of development. In this paper, we demonstrate and investigate how interactive and Loud Paper Prototyping (LPP) can be integrated to collect stakeholders' needs and expectations than interactive prototyping or face-to-face meetings alone. To this end, we conducted a case study of (1) 31 mobile application (App) development teams who applied either of interactive or loud prototyping and (2) 19 mobile App development teams who applied only the face-to-face meetings. From this study, we found that while using Silent Paper Prototyping (SPP) rather than No Paper Prototyping (NPP) is a more efficient technique to capture Non-Functional Requirements (NFRs), User Interface (UI) requirements, and existing requirements, LPP is more applicable to manage NFRs, UI requirements, as well as adding new requirements and removing/modifying the existing requirements. We also found that among LPP and SPP, LPP is more efficient to capture and influence Functional Requirements (FRs).