CVFeb 20, 2024

OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog

arXiv:2402.13146v181 citationsh-index: 7LREC
Originality Highly original
AI Analysis

This work solves video dialog tasks for AI systems, with incremental improvements in multi-modal state tracking.

The paper tackles the problem of video-grounded dialog by addressing challenges in spatial-temporal localization, long-term reasoning, and object tracking, achieving new state-of-the-art performance on DVD and SIMMC 2.1 datasets.

We present the Object Language Video Transformer (OLViT) - a novel model for video dialog operating over a multi-modal attention-based dialog state tracker. Existing video dialog models struggle with questions requiring both spatial and temporal localization within videos, long-term temporal reasoning, and accurate object tracking across multiple dialog turns. OLViT addresses these challenges by maintaining a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST): while the OST attends to the most important objects within the video, the LST keeps track of the most important linguistic co-references to previous dialog turns. In stark contrast to previous works, our approach is generic by nature and is therefore capable of learning continuous multi-modal dialog state representations of the most relevant objects and rounds. As a result, they can be seamlessly integrated into Large Language Models (LLMs) and offer high flexibility in dealing with different datasets and tasks. Evaluations on the challenging DVD (response classification) and SIMMC 2.1 (response generation) datasets show that OLViT achieves new state-of-the-art performance across both datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes