LGAIMLJan 24, 2019

The Assistive Multi-Armed Bandit

arXiv:1901.08654v142 citations
Originality Highly original
AI Analysis

This work contributes to a theory for human-robot interaction algorithms, addressing scenarios where humans are learning preferences rather than acting optimally.

The paper tackles the problem of a robot assisting a human in a multi-armed bandit task where the human learns the reward function through pulls, and the robot observes only arm choices, not rewards, establishing conditions for successful assistance and showing that better human performance alone does not guarantee better assisted outcomes.

Learning preferences implicit in the choices humans make is a well studied problem in both economics and computer science. However, most work makes the assumption that humans are acting (noisily) optimally with respect to their preferences. Such approaches can fail when people are themselves learning about what they want. In this work, we introduce the assistive multi-armed bandit, where a robot assists a human playing a bandit task to maximize cumulative reward. In this problem, the human does not know the reward function but can learn it through the rewards received from arm pulls; the robot only observes which arms the human pulls but not the reward associated with each pull. We offer sufficient and necessary conditions for successfully assisting the human in this framework. Surprisingly, better human performance in isolation does not necessarily lead to better performance when assisted by the robot: a human policy can do better by effectively communicating its observed rewards to the robot. We conduct proof-of-concept experiments that support these results. We see this work as contributing towards a theory behind algorithms for human-robot interaction.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes