LG MLMar 1, 2019

Combinatorial Bandits with Relative Feedback

arXiv:1903.00543v214.635 citations

Originality Incremental advance

AI Analysis

This work addresses a problem in online learning and decision-making for scenarios where only relative preferences are available, offering theoretical and empirical advances, though it appears incremental as it builds on existing bandit feedback models.

The paper tackles combinatorial online learning with subset choices using only relative feedback, specifically under the Multinomial logit choice model, and achieves instance-dependent, order-optimal regret algorithms with guarantees of O((n/m) ln T) and O((n/k) ln T) for two settings, showing the value of top-m rank-ordered feedback over single winner feedback.

We consider combinatorial online learning with subset choices when only relative feedback information from subsets is available, instead of bandit or semi-bandit feedback which is absolute. Specifically, we study two regret minimisation problems over subsets of a finite ground set $[n]$, with subset-wise relative preference information feedback according to the Multinomial logit choice model. In the first setting, the learner can play subsets of size bounded by a maximum size and receives top-$m$ rank-ordered feedback, while in the second setting the learner can play subsets of a fixed size $k$ with a full subset ranking observed as feedback. For both settings, we devise instance-dependent and order-optimal regret algorithms with regret $O(\frac{n}{m} \ln T)$ and $O(\frac{n}{k} \ln T)$, respectively. We derive fundamental limits on the regret performance of online learning with subset-wise preferences, proving the tightness of our regret guarantees. Our results also show the value of eliciting more general top-$m$ rank-ordered feedback over single winner feedback ($m=1$). Our theoretical results are corroborated with empirical evaluations.

View on arXiv PDF

Similar