Infinite Action Contextual Bandits with Reusable Data Exhaust
This work addresses a key obstacle for adopting smoothed regret methods in production scenarios, benefiting practitioners in machine learning and data science who need reliable offline analysis.
The paper tackled the problem of generating well-defined importance weights in infinite action contextual bandits, which is crucial for downstream data science processes like offline model selection, and achieved this by developing an online algorithm that maintains smoothed regret guarantees while increasing computational cost only to order smoothness, still independent of the action set.
For infinite action contextual bandits, smoothed regret and reduction to regression results in state-of-the-art online performance with computational cost independent of the action set: unfortunately, the resulting data exhaust does not have well-defined importance-weights. This frustrates the execution of downstream data science processes such as offline model selection. In this paper we describe an online algorithm with an equivalent smoothed regret guarantee, but which generates well-defined importance weights: in exchange, the online computational cost increases, but only to order smoothness (i.e., still independent of the action set). This removes a key obstacle to adoption of smoothed regret in production scenarios.