CV MMFeb 11, 2025

EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

Sheng Zhou, Junbin Xiao, Qingyun Li, Yicong Li, Xun Yang, Dan Guo, Meng Wang, Tat-Seng Chua, Angela Yao

arXiv:2502.07411v232.654 citationsh-index: 18Has CodeCVPR

Originality Synthesis-oriented

AI Analysis

This addresses the need for better QA assistance in real-world egocentric scenarios like driving and housekeeping, though it is incremental as it primarily provides a new benchmark.

The authors tackled the problem of egocentric scene-text aware video question answering by introducing the EgoTextVQA benchmark, which includes 1.5K videos and 7K questions, and found that current models, with the best achieving only 33% accuracy, struggle significantly in this task.

We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. With EgoTextVQA, we comprehensively evaluate 10 prominent multimodal large language models. Currently, all models struggle, and the best results (Gemini 1.5 Pro) are around 33\% accuracy, highlighting the severe deficiency of these techniques in egocentric QA assistance. Our further investigations suggest that precise temporal grounding and multi-frame reasoning, along with high resolution and auxiliary scene-text inputs, are key for better performance. With thorough analyses and heuristic suggestions, we hope EgoTextVQA can serve as a solid testbed for research in egocentric scene-text QA assistance. Our dataset is released at: https://github.com/zhousheng97/EgoTextVQA.

View on arXiv PDF Code

Similar