CLOct 12, 2021

Are you doing what I say? On modalities alignment in ALFRED

arXiv:2110.05665v1
Originality Incremental advance
AI Analysis

This addresses a key challenge in multimodal AI for task completion in simulated environments, but it is incremental as it builds on existing benchmarks and methods.

The paper tackles the problem of aligning text and visual modalities in the ALFRED benchmark, showing that existing models fail at this alignment, and introduces approaches that improve alignment and end-task performance.

ALFRED is a recently proposed benchmark that requires a model to complete tasks in simulated house environments specified by instructions in natural language. We hypothesize that key to success is accurately aligning the text modality with visual inputs. Motivated by this, we inspect how well existing models can align these modalities using our proposed intrinsic metric, boundary adherence score (BAS). The results show the previous models are indeed failing to perform proper alignment. To address this issue, we introduce approaches aimed at improving model alignment and demonstrate how improved alignment, improves end task performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes