CLAICVLGMMJan 5, 2023

SPRING: Situated Conversation Agent Pretrained with Multimodal Questions from Incremental Layout Graph

arXiv:2301.01949v112 citationsh-index: 27
Originality Incremental advance
AI Analysis

This addresses a bottleneck in response quality for situated conversation agents, though it appears incremental as it builds on existing multimodal QA tasks with novel pretraining methods.

The paper tackles the problem of multimodal conversation agents struggling with complex relative positions and information alignments in crowded scenarios, proposing SPRING, which significantly outperforms state-of-the-art approaches on SIMMC 1.0 and SIMMC 2.0 datasets.

Existing multimodal conversation agents have shown impressive abilities to locate absolute positions or retrieve attributes in simple scenarios, but they fail to perform well when complex relative positions and information alignments are involved, which poses a bottleneck in response quality. In this paper, we propose a Situated Conversation Agent Petrained with Multimodal Questions from INcremental Layout Graph (SPRING) with abilities of reasoning multi-hops spatial relations and connecting them with visual attributes in crowded situated scenarios. Specifically, we design two types of Multimodal Question Answering (MQA) tasks to pretrain the agent. All QA pairs utilized during pretraining are generated from novel Incremental Layout Graphs (ILG). QA pair difficulty labels automatically annotated by ILG are used to promote MQA-based Curriculum Learning. Experimental results verify the SPRING's effectiveness, showing that it significantly outperforms state-of-the-art approaches on both SIMMC 1.0 and SIMMC 2.0 datasets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes