LGApr 12, 2021

Pure Exploration with Structured Preference Feedback

Shubham Gupta, Aadirupa Saha, Sumeet Katariya

arXiv:2104.05294v13.11 citations

Originality Incremental advance

AI Analysis

This addresses online decision-making scenarios like retailing and advertising where human feedback is easier to obtain via preferences, but it is incremental as it extends existing preference feedback models to a structured setting.

The paper tackles the problem of pure exploration with subset-wise preference feedback, where the goal is to efficiently identify the best arm using as few queries as possible, and presents algorithms that guarantee detection in $ ilde{O} (rac{d^2}{K \Delta^2})$ samples with a matching lower bound, with experiments showing up to 12x fewer samples than non-adaptive methods.

We consider the problem of pure exploration with subset-wise preference feedback, which contains $N$ arms with features. The learner is allowed to query subsets of size $K$ and receives feedback in the form of a noisy winner. The goal of the learner is to identify the best arm efficiently using as few queries as possible. This setting is relevant in various online decision-making scenarios involving human feedback such as online retailing, streaming services, news feed, and online advertising; since it is easier and more reliable for people to choose a preferred item from a subset than to assign a likability score to an item in isolation. To the best of our knowledge, this is the first work that considers the subset-wise preference feedback model in a structured setting, which allows for potentially infinite set of arms. We present two algorithms that guarantee the detection of the best-arm in $\tilde{O} (\frac{d^2}{K Δ^2})$ samples with probability at least $1 - δ$, where $d$ is the dimension of the arm-features and $Δ$ is the appropriate notion of utility gap among the arms. We also derive an instance-dependent lower bound of $Ω(\frac{d}{Δ^2} \log \frac{1}δ)$ which matches our upper bound on a worst-case instance. Finally, we run extensive experiments to corroborate our theoretical findings, and observe that our adaptive algorithm stops and requires up to 12x fewer samples than a non-adaptive algorithm.

View on arXiv PDF

Similar