Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study
This addresses the challenge of deploying large language models in clinical settings by enabling targeted interventions for different uncertainty types, though it is incremental as it builds on existing uncertainty quantification methods.
The paper tackled the problem of distinguishing between input ambiguity and model instability in large language models for clinical Text-to-SQL, proposing CLUES to decompose semantic uncertainty into scores for each cause, which improved failure prediction over state-of-the-art methods and identified that 51% of errors occur in a high-ambiguity/high-instability regime covering 25% of queries.
Deploying large language models for clinical Text-to-SQL requires distinguishing two qualitatively different causes of output diversity: (i) input ambiguity that should trigger clarification, and (ii) model instability that should trigger human review. We propose CLUES, a framework that models Text-to-SQL as a two-stage process (interpretations --> answers) and decomposes semantic uncertainty into an ambiguity score and an instability score. The instability score is computed via the Schur complement of a bipartite semantic graph matrix. Across AmbigQA/SituatedQA (gold interpretations) and a clinical Text-to-SQL benchmark (known interpretations), CLUES improves failure prediction over state-of-the-art Kernel Language Entropy. In deployment settings, it remains competitive while providing a diagnostic decomposition unavailable from a single score. The resulting uncertainty regimes map to targeted interventions - query refinement for ambiguity, model improvement for instability. The high-ambiguity/high-instability regime contains 51% of errors while covering 25% of queries, enabling efficient triage.