ML LGJul 15, 2025

LLMs are Bayesian, in Expectation, not in Realization

Leon Chlon, Sarah Rashidi, Zein Khamis, MarcAntonio M. Awada

arXiv:2507.11768v118.07 citationsh-index: 2

Originality Incremental advance

AI Analysis

This addresses uncertainty quantification in critical applications for AI deployment, providing practical methods for calibration and efficiency, but is incremental as it builds on existing Bayesian frameworks.

The paper tackles the contradiction that transformers violate the martingale property, challenging Bayesian inference models for in-context learning, and shows they achieve optimality with excess risk O(n^{-1/2}) and reach 99% of theoretical entropy limits within 20 examples.

Large language models demonstrate remarkable in-context learning capabilities, adapting to new tasks without parameter updates. While this phenomenon has been successfully modeled as implicit Bayesian inference, recent empirical findings reveal a fundamental contradiction: transformers systematically violate the martingale property, a cornerstone requirement of Bayesian updating on exchangeable data. This violation challenges the theoretical foundations underlying uncertainty quantification in critical applications. Our theoretical analysis establishes four key results: (1) positional encodings induce martingale violations of order $Θ(\log n / n)$; (2) transformers achieve information-theoretic optimality with excess risk $O(n^{-1/2})$ in expectation over orderings; (3) the implicit posterior representation converges to the true Bayesian posterior in the space of sufficient statistics; and (4) we derive the optimal chain-of-thought length as $k^* = Θ(\sqrt{n}\log(1/\varepsilon))$ with explicit constants, providing a principled approach to reduce inference costs while maintaining performance. Empirical validation on GPT-3 confirms predictions (1)-(3), with transformers reaching 99\% of theoretical entropy limits within 20 examples. Our framework provides practical methods for extracting calibrated uncertainty estimates from position-aware architectures and optimizing computational efficiency in deployment.

View on arXiv PDF

Similar