MeetUp! A Corpus of Joint Activity Dialogues in a Visual Environment
This addresses the problem of limited dialogue complexity in vision-language tasks for researchers in AI and computational linguistics, though it is incremental as it builds on prior datasets.
The authors tackled the oversimplification of dialogue in existing vision-language datasets by introducing MeetUp!, a two-player coordination game requiring visual and conversational grounding to find each other in a visual environment. They collected data showing that the dialogues exhibit targeted phenomena and challenge language-vision integration.
Building computer systems that can converse about their visual environment is one of the oldest concerns of research in Artificial Intelligence and Computational Linguistics (see, for example, Winograd's 1972 SHRDLU system). Only recently, however, have methods from computer vision and natural language processing become powerful enough to make this vision seem more attainable. Pushed especially by developments in computer vision, many data sets and collection environments have recently been published that bring together verbal interaction and visual processing. Here, we argue that these datasets tend to oversimplify the dialogue part, and we propose a task---MeetUp!---that requires both visual and conversational grounding, and that makes stronger demands on representations of the discourse. MeetUp! is a two-player coordination game where players move in a visual environment, with the objective of finding each other. To do so, they must talk about what they see, and achieve mutual understanding. We describe a data collection and show that the resulting dialogues indeed exhibit the dialogue phenomena of interest, while also challenging the language & vision aspect.