CL AI LG MLJun 20, 2025

UProp: Investigating the Uncertainty Propagation of LLMs in Multi-Step Agentic Decision-Making

Jinhao Duan, James Diffenderfer, Sandeep Madireddy, Tianlong Chen, Bhavya Kailkhura, Kaidi Xu

arXiv:2506.17419v117 citationsh-index: 15Has Code

Originality Incremental advance

AI Analysis

This addresses the need for reliable uncertainty estimation in safety-critical LLM applications, though it is incremental as it builds on existing UQ methods for sequential tasks.

The paper tackles the problem of uncertainty quantification in multi-step decision-making by LLMs, introducing UProp, a framework that decomposes uncertainty into internal and extrinsic components, and shows it significantly outperforms existing single-turn methods on benchmarks like AgentBench and HotpotQA with models such as GPT-4.1 and DeepSeek-V3.

As Large Language Models (LLMs) are integrated into safety-critical applications involving sequential decision-making in the real world, it is essential to know when to trust LLM decisions. Existing LLM Uncertainty Quantification (UQ) methods are primarily designed for single-turn question-answering formats, resulting in multi-step decision-making scenarios, e.g., LLM agentic system, being underexplored. In this paper, we introduce a principled, information-theoretic framework that decomposes LLM sequential decision uncertainty into two parts: (i) internal uncertainty intrinsic to the current decision, which is focused on existing UQ methods, and (ii) extrinsic uncertainty, a Mutual-Information (MI) quantity describing how much uncertainty should be inherited from preceding decisions. We then propose UProp, an efficient and effective extrinsic uncertainty estimator that converts the direct estimation of MI to the estimation of Pointwise Mutual Information (PMI) over multiple Trajectory-Dependent Decision Processes (TDPs). UProp is evaluated over extensive multi-step decision-making benchmarks, e.g., AgentBench and HotpotQA, with state-of-the-art LLMs, e.g., GPT-4.1 and DeepSeek-V3. Experimental results demonstrate that UProp significantly outperforms existing single-turn UQ baselines equipped with thoughtful aggregation strategies. Moreover, we provide a comprehensive analysis of UProp, including sampling efficiency, potential applications, and intermediate uncertainty propagation, to demonstrate its effectiveness. Codes will be available at https://github.com/jinhaoduan/UProp.

View on arXiv PDF Code

Similar