CLCVApr 17, 2021

Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task Feasibility in Interactive Visual Environments

arXiv:2104.08560v132 citations
Originality Synthesis-oriented
AI Analysis

This addresses the need for more complex reasoning in vision-language tasks for researchers, though it is incremental as it focuses on dataset creation.

The authors tackled the problem of task feasibility in interactive visual environments by introducing the MoTIF dataset, which includes natural language commands for diverse digital domains and unsatisfiable requests, and they achieved an initial F1 score of 37.3 in feasibility classification experiments.

In recent years, vision-language research has shifted to study tasks which require more complex reasoning, such as interactive question answering, visual common sense reasoning, and question-answer plausibility prediction. However, the datasets used for these problems fail to capture the complexity of real inputs and multimodal environments, such as ambiguous natural language requests and diverse digital domains. We introduce Mobile app Tasks with Iterative Feedback (MoTIF), a dataset with natural language commands for the greatest number of interactive environments to date. MoTIF is the first to contain natural language requests for interactive environments that are not satisfiable, and we obtain follow-up questions on this subset to enable research on task uncertainty resolution. We perform initial feasibility classification experiments and only reach an F1 score of 37.3, verifying the need for richer vision-language representations and improved architectures to reason about task feasibility.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes