LG AI MLJan 24, 2019

The Assistive Multi-Armed Bandit

Lawrence Chan, Dylan Hadfield-Menell, Siddhartha Srinivasa, Anca Dragan

arXiv:1901.08654v113.143 citationsHas Code

Originality Highly original

AI Analysis

This work contributes to a theory for human-robot interaction algorithms, addressing scenarios where humans are learning preferences rather than acting optimally.

The paper tackles the problem of a robot assisting a human in a multi-armed bandit task where the human learns the reward function through pulls, and the robot observes only arm choices, not rewards, establishing conditions for successful assistance and showing that better human performance alone does not guarantee better assisted outcomes.

Learning preferences implicit in the choices humans make is a well studied problem in both economics and computer science. However, most work makes the assumption that humans are acting (noisily) optimally with respect to their preferences. Such approaches can fail when people are themselves learning about what they want. In this work, we introduce the assistive multi-armed bandit, where a robot assists a human playing a bandit task to maximize cumulative reward. In this problem, the human does not know the reward function but can learn it through the rewards received from arm pulls; the robot only observes which arms the human pulls but not the reward associated with each pull. We offer sufficient and necessary conditions for successfully assisting the human in this framework. Surprisingly, better human performance in isolation does not necessarily lead to better performance when assisted by the robot: a human policy can do better by effectively communicating its observed rewards to the robot. We conduct proof-of-concept experiments that support these results. We see this work as contributing towards a theory behind algorithms for human-robot interaction.

View on arXiv PDF Code

Similar