Audio-Visual Scene-Aware Dialog
This addresses the challenge of multimodal dialog understanding for AI systems, though it is incremental as it builds on existing datasets and tasks.
The paper tackles the problem of generating natural responses to questions about scenes by introducing the Audio-Visual Scene-Aware Dialog (AVSD) task, which requires grounding concepts in video and audio while using dialog history, and benchmarks it with a dataset of over 11,000 videos, showing that models perform best when utilizing all inputs.
We introduce the task of scene-aware dialog. Our goal is to generate a complete and natural response to a question about a scene, given video and audio of the scene and the history of previous turns in the dialog. To answer successfully, agents must ground concepts from the question in the video while leveraging contextual cues from the dialog history. To benchmark this task, we introduce the Audio Visual Scene-Aware Dialog (AVSD) Dataset. For each of more than 11,000 videos of human actions from the Charades dataset, our dataset contains a dialog about the video, plus a final summary of the video by one of the dialog participants. We train several baseline systems for this task and evaluate the performance of the trained models using both qualitative and quantitative metrics. Our results indicate that models must utilize all the available inputs (video, audio, question, and dialog history) to perform best on this dataset.