Coarse Q-learning: Indifference vs. Indeterminacy vs. Instability
For researchers in reinforcement learning and decision theory, this work provides a novel framework to understand how coarse aggregation of alternatives can lead to qualitatively different long-run behaviors, including instability and limit cycles.
Coarse Q-learning (CQL) models bandit problems with stochastically varying menus where alternatives are partitioned into similarity classes, pooling feedback within classes. The model exhibits multiple stable equilibria, a unique globally stable mixed equilibrium, or stable limit cycles depending on the environment, phenomena absent in standard alternative-level benchmarks.
We introduce Coarse Q-learning (CQL), a reinforcement-learning model for bandit problems with stochastically varying menus. Alternatives are exogenously partitioned into similarity classes, and feedback from sampled alternatives is pooled within classes into class-level valuations. Choices follow multinomial logit over class valuations, and valuations update toward realized payoffs as in Q-learning. Using stochastic approximation, we derive the mean-field dynamics and characterize the steady states as smooth analogues of Valuation Equilibria. The model yields novel long-run phenomena in the high payoff-sensitivity limit: depending on the environment, CQL may exhibit multiple stable strict equilibria, a unique globally stable mixed equilibrium with indifference across classes, or no stable equilibrium at all, with valuations and choice probabilities converging instead to a stable limit cycle. These outcomes are driven by coarse aggregation and do not arise in the standard alternative-level benchmark.