CVAISep 26, 2024

EAGLE: Egocentric AGgregated Language-video Engine

arXiv:2409.17523v118 citationsh-index: 13
Originality Incremental advance
AI Analysis

This addresses the problem of inconsistent annotations and isolated model development for researchers and practitioners in egocentric video understanding, though it is incremental as it builds on existing multimodal large language models.

The authors tackled the fragmentation in egocentric video analysis by introducing EAGLE, a unified model, and EAGLE-400K, a large-scale dataset, achieving superior performance over existing models in tasks like action recognition and procedure learning.

The rapid evolution of egocentric video analysis brings new insights into understanding human activities and intentions from a first-person perspective. Despite this progress, the fragmentation in tasks like action recognition, procedure learning, and moment retrieval, \etc, coupled with inconsistent annotations and isolated model development, hinders a holistic interpretation of video content. In response, we introduce the EAGLE (Egocentric AGgregated Language-video Engine) model and the EAGLE-400K dataset to provide a unified framework that integrates various egocentric video understanding tasks. EAGLE-400K, the \textit{first} large-scale instruction-tuning dataset tailored for egocentric video, features 400K diverse samples to enhance a broad spectrum of tasks from activity recognition to procedure knowledge learning. Moreover, EAGLE, a strong video multimodal large language model (MLLM), is designed to effectively capture both spatial and temporal information. In addition, we propose a set of evaluation metrics designed to facilitate a thorough assessment of MLLM for egocentric video understanding. Our extensive experiments demonstrate EAGLE's superior performance over existing models, highlighting its ability to balance task-specific understanding with holistic video interpretation. With EAGLE, we aim to pave the way for research opportunities and practical applications in real-world scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes