Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge
This work addresses video understanding challenges for AI systems in egocentric scenarios, representing an incremental improvement through pipeline optimization.
The paper tackled the problem of Multimodal Large Language Models (MLLMs) struggling with complex video QA benchmarks like HD-EPIC VQA due to ambiguous queries, poor temporal reasoning, and non-standardized outputs, achieving 41.6% accuracy on HD-EPIC VQA.
Multimodal Large Language Models (MLLMs) struggle with complex video QA benchmarks like HD-EPIC VQA due to ambiguous queries/options, poor long-range temporal reasoning, and non-standardized outputs. We propose a framework integrating query/choice pre-processing, domain-specific Qwen2.5-VL fine-tuning, a novel Temporal Chain-of-Thought (T-CoT) prompting for multi-step reasoning, and robust post-processing. This system achieves 41.6% accuracy on HD-EPIC VQA, highlighting the need for holistic pipeline optimization in demanding video understanding. Our code, fine-tuned models are available at https://github.com/YoungSeng/Egocentric-Co-Pilot.