LGAIMLJun 15, 2020

Non-Stationary Off-Policy Optimization

arXiv:2006.08236v32 citations
AI Analysis

This addresses the challenge of adapting offline learned policies to real-world non-stationary changes, offering a practical solution with guarantees, though it is incremental as it builds on existing off-policy learning frameworks.

The paper tackles the problem of off-policy optimization in non-stationary environments, specifically piecewise-stationary contextual bandits, by proposing a two-phase method that partitions data into latent states and adaptively switches sub-policies, outperforming baselines on synthetic and real-world datasets.

Off-policy learning is a framework for evaluating and optimizing policies without deploying them, from data collected by another policy. Real-world environments are typically non-stationary and the offline learned policies should adapt to these changes. To address this challenge, we study the novel problem of off-policy optimization in piecewise-stationary contextual bandits. Our proposed solution has two phases. In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state. In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance. This approach is practical and analyzable, and we provide guarantees on both the quality of off-policy optimization and the regret during online deployment. To show the effectiveness of our approach, we compare it to state-of-the-art baselines on both synthetic and real-world datasets. Our approach outperforms methods that act only on observed context.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes