Adaptive Exploration for Latent-State Bandits
This addresses the challenge of sequential decision-making under uncertainty with unobserved confounders for applications like recommendation systems or adaptive control, though it appears incremental as it builds on existing bandit frameworks with novel adaptations.
The paper tackled the problem of multi-armed bandits with hidden, time-varying states, which cause biased reward estimates and suboptimal decisions, by introducing state-model-free algorithms that use lagged contexts and probing strategies to track latent states and improve reward patterns, resulting in superior performance over classical methods in diverse settings.
The multi-armed bandit problem is a core framework for sequential decision-making under uncertainty, but classical algorithms often fail in environments with hidden, time-varying states that confound reward estimation and optimal action selection. We address key challenges arising from unobserved confounders, such as biased reward estimates and limited state information, by introducing a family of state-model-free bandit algorithms that leverage lagged contextual features and coordinated probing strategies. These implicitly track latent states and disambiguate state-dependent reward patterns. Our methods and their adaptive variants can learn optimal policies without explicit state modeling, combining computational efficiency with robust adaptation to non-stationary rewards. Empirical results across diverse settings demonstrate superior performance over classical approaches, and we provide practical recommendations for algorithm selection in real-world applications.