CVJun 18, 2024

DrVideo: Document Retrieval Based Long Video Understanding

arXiv:2406.12846v260 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of long video understanding for AI systems, offering a novel approach that improves performance on specific benchmarks, though it is incremental in applying document retrieval techniques to a new domain.

The authors tackled the problem of understanding long videos, which existing methods struggle with due to difficulty in locating key information and performing long-range reasoning, by proposing DrVideo, a document-retrieval-based system that converts videos into text documents and uses an iterative agent loop; it significantly outperformed state-of-the-art LLM-based methods on benchmarks like EgoSchema (3 minutes), MovieChat-1K (10 minutes), and Video-MME (average 44 minutes).

Most of the existing methods for video understanding primarily focus on videos only lasting tens of seconds, with limited exploration of techniques for handling long videos. The increased number of frames in long videos poses two main challenges: difficulty in locating key information and performing long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based system designed for long video understanding. Our key idea is to convert the long-video understanding problem into a long-document understanding task so as to effectively leverage the power of large language models. Specifically, DrVideo first transforms a long video into a coarse text-based long document to initially retrieve key frames and then updates the documents with the augmented key frame information. It then employs an agent-based iterative loop to continuously search for missing information and augment the document until sufficient question-related information is gathered for making the final predictions in a chain-of-thought manner. Extensive experiments on long video benchmarks confirm the effectiveness of our method. DrVideo significantly outperforms existing LLM-based state-of-the-art methods on EgoSchema benchmark (3 minutes), MovieChat-1K benchmark (10 minutes), and the long split of Video-MME benchmark (average of 44 minutes).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes