AIJun 1, 2025

Do not Abstain! Identify and Solve the Uncertainty

arXiv:2506.00780v18 citationsh-index: 9ACL
Originality Incremental advance
AI Analysis

This addresses the issue of unreliable LLM responses in uncertain situations for users, but it is incremental as it builds on existing uncertainty handling approaches.

The paper tackles the problem of LLMs being overconfident in uncertain scenarios by introducing ConfuseBench, a benchmark for three types of uncertainty, and finds that current LLMs struggle to identify root causes, preferring to blame query ambiguity. They propose a method using context-aware inquiries and InteractDPO training, showing efficacy in experiments.

Despite the widespread application of Large Language Models (LLMs) across various domains, they frequently exhibit overconfidence when encountering uncertain scenarios, yet existing solutions primarily rely on evasive responses (e.g., "I don't know") overlooks the opportunity of identifying and addressing the uncertainty to generate more satisfactory responses. To systematically investigate and improve LLMs' ability of recognizing and addressing the source of uncertainty, we introduce \textbf{ConfuseBench}, a benchmark mainly focus on three types of uncertainty: document scarcity, limited capability, and query ambiguity. Experiments with ConfuseBench reveal that current LLMs struggle to accurately identify the root cause of uncertainty and solve it. They prefer to attribute uncertainty to query ambiguity while overlooking capability limitations, especially for those weaker models. To tackle this challenge, we first generate context-aware inquiries that highlight the confusing aspect of the original query. Then we judge the source of uncertainty based on the uniqueness of the inquiry's answer. Further we use an on-policy training method, InteractDPO to generate better inquiries. Experimental results demonstrate the efficacy of our approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes