GTMar 13

A Mathematical Programming Approach to Computing and Learning Berk--Nash Equilibria in Infinite-Horizon MDPs

arXiv:2603.1364149.6h-index: 5
AI Analysis

This work addresses model misspecification in reinforcement learning for agents in complex environments, representing an incremental advance with a novel algorithmic approach.

The paper tackles the problem of sequential decision-making under model misspecification in infinite-horizon Markov decision processes by characterizing Berk-Nash equilibria via mathematical programming and introducing entropy regularization to ensure smoothness. It results in an online learning scheme with sublinear regret and convergence to the KL-minimizing model, as demonstrated numerically.

We study sequential decision-making when the agent's internal model class is misspecified. Within the infinite-horizon Berk-Nash framework, stable behavior arises as a fixed point: the agent acts optimally relative to a subjective model, while that model is statistically consistent with the long-run data endogenously generated by the policy itself. We provide a rigorous characterization of this equilibrium via coupled linear programs and a bilevel optimization formulation. To address the intrinsic non-smoothness of standard best-response correspondences, we introduce entropy regularization, establishing the existence of a unique soft Bellman fixed point and a smooth objective. Exploiting this regularity, we develop an online learning scheme that casts model selection as an adversarial bandit problem using an EXP3-type update, augmented by a novel conjecture-set zooming mechanism that adaptively refines the parameter space. Numerical results demonstrate effective exploration-exploitation trade-offs, convergence to the KL-minimizing model, and sublinear regret.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes