Latent Emission-Augmented Perspective-Taking (LEAPT) for Human-Robot Interaction
This addresses the challenge of uncertainty in perspective-taking for human-robot interaction, though it appears incremental as it builds on probabilistic graphical models and deep learning methods.
The paper tackled the problem of enabling robots to perform perspective-taking in partially-observable human-robot interactions by proposing a deep world model with a decomposed multi-modal latent state space, and it significantly outperformed existing baselines in predicting human observations and beliefs on three tasks.
Perspective-taking is the ability to perceive or understand a situation or concept from another individual's point of view, and is crucial in daily human interactions. Enabling robots to perform perspective-taking remains an unsolved problem; existing approaches that use deterministic or handcrafted methods are unable to accurately account for uncertainty in partially-observable settings. This work proposes to address this limitation via a deep world model that enables a robot to perform both perception and conceptual perspective taking, i.e., the robot is able to infer what a human sees and believes. The key innovation is a decomposed multi-modal latent state space model able to generate and augment fictitious observations/emissions. Optimizing the ELBO that arises from this probabilistic graphical model enables the learning of uncertainty in latent space, which facilitates uncertainty estimation from high-dimensional observations. We tasked our model to predict human observations and beliefs on three partially-observable HRI tasks. Experiments show that our method significantly outperforms existing baselines and is able to infer visual observations available to other agent and their internal beliefs.