CLMar 29, 2024

LUQ: Long-text Uncertainty Quantification for LLMs

Cambridge
arXiv:2403.20279v390 citationsh-index: 10EMNLP
Originality Incremental advance
AI Analysis

This addresses the challenge of mitigating nonfactual outputs in LLMs for real-world applications requiring long responses, representing a domain-specific incremental advance.

The paper tackles the problem of uncertainty quantification (UQ) for long-text generation in LLMs, which is limited by existing methods focused on short text, and introduces LUQ and its variations, showing that LUQ outperforms baselines with a correlation coefficient of -0.85 for factuality scores and LUQ-Ensemble improves response factuality over standalone models.

Large Language Models (LLMs) have demonstrated remarkable capability in a variety of NLP tasks. However, LLMs are also prone to generate nonfactual content. Uncertainty Quantification (UQ) is pivotal in enhancing our understanding of a model's confidence on its generation, thereby aiding in the mitigation of nonfactual outputs. Existing research on UQ predominantly targets short text generation, typically yielding brief, word-limited responses. However, real-world applications frequently necessitate much longer responses. Our study first highlights the limitations of current UQ methods in handling long text generation. We then introduce \textsc{Luq} and its two variations, a series of novel sampling-based UQ approaches specifically designed for long text. Our findings reveal that \textsc{Luq} outperforms existing baseline methods in correlating with the model's factuality scores (negative coefficient of -0.85 observed for Gemini Pro). To further improve the factuality of LLM responses, we propose \textsc{Luq-Ensemble}, a method that ensembles responses from multiple models and selects the response with the lowest uncertainty. The ensembling method greatly improves the response factuality upon the best standalone LLM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes