Advancing Egocentric Video Question Answering with Multimodal Large Language Models
This work addresses challenges in egocentric video understanding for AI systems, but it is incremental as it focuses on benchmarking and improving existing methods on a new dataset.
This paper tackled the problem of Egocentric Video Question Answering by evaluating Multimodal Large Language Models on a refined dataset, achieving new state-of-the-art performance with fine-tuned models, such as up to +2.6% ROUGE/METEOR for OpenQA and +13% accuracy for CloseQA.
Egocentric Video Question Answering (QA) requires models to handle long-horizon temporal reasoning, first-person perspectives, and specialized challenges like frequent camera movement. This paper systematically evaluates both proprietary and open-source Multimodal Large Language Models (MLLMs) on QaEgo4Dv2 - a refined dataset of egocentric videos derived from QaEgo4D. Four popular MLLMs (GPT-4o, Gemini-1.5-Pro, Video-LLaVa-7B and Qwen2-VL-7B-Instruct) are assessed using zero-shot and fine-tuned approaches for both OpenQA and CloseQA settings. We introduce QaEgo4Dv2 to mitigate annotation noise in QaEgo4D, enabling more reliable comparison. Our results show that fine-tuned Video-LLaVa-7B and Qwen2-VL-7B-Instruct achieve new state-of-the-art performance, surpassing previous benchmarks by up to +2.6% ROUGE/METEOR (for OpenQA) and +13% accuracy (for CloseQA). We also present a thorough error analysis, indicating the model's difficulty in spatial reasoning and fine-grained object recognition - key areas for future improvement.