CVFeb 20, 2024

OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog

Adnen Abdessaied, Manuel von Hochmeister, Andreas Bulling

arXiv:2402.13146v131.981 citationsh-index: 7LREC

Originality Highly original

AI Analysis

This work solves video dialog tasks for AI systems, with incremental improvements in multi-modal state tracking.

The paper tackles the problem of video-grounded dialog by addressing challenges in spatial-temporal localization, long-term reasoning, and object tracking, achieving new state-of-the-art performance on DVD and SIMMC 2.1 datasets.

We present the Object Language Video Transformer (OLViT) - a novel model for video dialog operating over a multi-modal attention-based dialog state tracker. Existing video dialog models struggle with questions requiring both spatial and temporal localization within videos, long-term temporal reasoning, and accurate object tracking across multiple dialog turns. OLViT addresses these challenges by maintaining a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST): while the OST attends to the most important objects within the video, the LST keeps track of the most important linguistic co-references to previous dialog turns. In stark contrast to previous works, our approach is generic by nature and is therefore capable of learning continuous multi-modal dialog state representations of the most relevant objects and rounds. As a result, they can be seamlessly integrated into Large Language Models (LLMs) and offer high flexibility in dealing with different datasets and tasks. Evaluations on the challenging DVD (response classification) and SIMMC 2.1 (response generation) datasets show that OLViT achieves new state-of-the-art performance across both datasets.

View on arXiv PDF

Similar