LGAIMLSep 17, 2020

Online Semi-Supervised Learning in Contextual Bandits with Episodic Reward

arXiv:2009.08457v214 citations
AI Analysis

This addresses a practical challenge for decision-making agents in real-world applications like recommendation systems, though it is incremental as it builds on existing contextual bandit methods.

The paper tackled the problem of online learning with episodically revealed rewards in nonstationary contexts, where rewards are not always available, by introducing BerlinUCB, which incorporates clustering for self-supervision. Experiments across six scenarios showed clear advantages over standard contextual bandits, with improvements in cumulative regret and sample efficiency.

We considered a novel practical problem of online learning with episodically revealed rewards, motivated by several real-world applications, where the contexts are nonstationary over different episodes and the reward feedbacks are not always available to the decision making agents. For this online semi-supervised learning setting, we introduced Background Episodic Reward LinUCB (BerlinUCB), a solution that easily incorporates clustering as a self-supervision module to provide useful side information when rewards are not observed. Our experiments on a variety of datasets, both in stationary and nonstationary environments of six different scenarios, demonstrated clear advantages of the proposed approach over the standard contextual bandit. Lastly, we introduced a relevant real-life example where this problem setting is especially useful.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes