Yanlin Qu

h-index42
2papers

2 Papers

LGOct 8, 2025
A Broader View of Thompson Sampling

Yanlin Qu, Hongseok Namkoong, Assaf Zeevi

Thompson Sampling is one of the most widely used and studied bandit algorithms, known for its simple structure, low regret performance, and solid theoretical guarantees. Yet, in stark contrast to most other families of bandit algorithms, the exact mechanism through which posterior sampling (as introduced by Thompson) is able to "properly" balance exploration and exploitation, remains a mystery. In this paper we show that the core insight to address this question stems from recasting Thompson Sampling as an online optimization algorithm. To distill this, a key conceptual tool is introduced, which we refer to as "faithful" stationarization of the regret formulation. Essentially, the finite horizon dynamic optimization problem is converted into a stationary counterpart which "closely resembles" the original objective (in contrast, the classical infinite horizon discounted formulation, that leads to the Gittins index, alters the problem and objective in too significant a manner). The newly crafted time invariant objective can be studied using Bellman's principle which leads to a time invariant optimal policy. When viewed through this lens, Thompson Sampling admits a simple online optimization form that mimics the structure of the Bellman-optimal policy, and where greediness is regularized by a measure of residual uncertainty based on point-biserial correlation. This answers the question of how Thompson Sampling balances exploration-exploitation, and moreover, provides a principled framework to study and further improve Thompson's original idea.

LGAug 22, 2025
Deep Learning for Markov Chains: Lyapunov Functions, Poisson's Equation, and Stationary Distributions

Yanlin Qu, Jose Blanchet, Peter Glynn

Lyapunov functions are fundamental to establishing the stability of Markovian models, yet their construction typically demands substantial creativity and analytical effort. In this paper, we show that deep learning can automate this process by training neural networks to satisfy integral equations derived from first-transition analysis. Beyond stability analysis, our approach can be adapted to solve Poisson's equation and estimate stationary distributions. While neural networks are inherently function approximators on compact domains, it turns out that our approach remains effective when applied to Markov chains on non-compact state spaces. We demonstrate the effectiveness of this methodology through several examples from queueing theory and beyond.