Exploring Offline Policy Evaluation for the Continuous-Armed Bandit Problem
This work addresses the problem of evaluating sequential decision-making policies in continuous settings for researchers and practitioners, but it is incremental as it adapts an existing method.
The paper tackled the challenge of offline policy evaluation for the continuous-armed bandit problem by extending an existing method to handle continuous action sets, empirically showing it provides a relatively consistent ranking of policies.
The (contextual) multi-armed bandit problem (MAB) provides a formalization of sequential decision-making which has many applications. However, validly evaluating MAB policies is challenging; we either resort to simulations which inherently include debatable assumptions, or we resort to expensive field trials. Recently an offline evaluation method has been suggested that is based on empirical data, thus relaxing the assumptions, and can be used to evaluate multiple competing policies in parallel. This method is however not directly suited for the continuous armed (CAB) problem; an often encountered version of the MAB problem in which the action set is continuous instead of discrete. We propose and evaluate an extension of the existing method such that it can be used to evaluate CAB policies. We empirically demonstrate that our method provides a relatively consistent ranking of policies. Furthermore, we detail how our method can be used to select policies in a real-life CAB problem.