Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge

Sicheng Yang, Yukai Huang, Shitong Sun, Weitong Cai, Jiankang Deng, Jifei Song, Zhensong Zhang

arXiv:2601.10228v14.03 citationsh-index: 4Has Code

Originality Incremental advance

AI Analysis

This work addresses video understanding challenges for AI systems in egocentric scenarios, representing an incremental improvement through pipeline optimization.

The paper tackled the problem of Multimodal Large Language Models (MLLMs) struggling with complex video QA benchmarks like HD-EPIC VQA due to ambiguous queries, poor temporal reasoning, and non-standardized outputs, achieving 41.6% accuracy on HD-EPIC VQA.

Multimodal Large Language Models (MLLMs) struggle with complex video QA benchmarks like HD-EPIC VQA due to ambiguous queries/options, poor long-range temporal reasoning, and non-standardized outputs. We propose a framework integrating query/choice pre-processing, domain-specific Qwen2.5-VL fine-tuning, a novel Temporal Chain-of-Thought (T-CoT) prompting for multi-step reasoning, and robust post-processing. This system achieves 41.6% accuracy on HD-EPIC VQA, highlighting the need for holistic pipeline optimization in demanding video understanding. Our code, fine-tuned models are available at https://github.com/YoungSeng/Egocentric-Co-Pilot.

View on arXiv PDF Code

Similar