Learning to Summarize and Answer Questions about a Virtual Robot's Past Actions
This addresses the problem of human-robot interaction transparency for users who need to monitor robot activities, though it is incremental as it builds on existing language model capabilities in a virtual environment.
The paper tackles the problem of enabling users to query and understand a robot's past actions through natural language by training a single large language model system to both summarize action sequences and answer questions about them from ego-centric video frames. The result is a system that achieves zero-shot transfer of object representations learned through question answering to improve action summarization.
When robots perform long action sequences, users will want to easily and reliably find out what they have done. We therefore demonstrate the task of learning to summarize and answer questions about a robot agent's past actions using natural language alone. A single system with a large language model at its core is trained to both summarize and answer questions about action sequences given ego-centric video frames of a virtual robot and a question prompt. To enable training of question answering, we develop a method to automatically generate English-language questions and answers about objects, actions, and the temporal order in which actions occurred during episodes of robot action in the virtual environment. Training one model to both summarize and answer questions enables zero-shot transfer of representations of objects learned through question answering to improved action summarization. % involving objects not seen in training to summarize.