CVCLNov 27, 2023

EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models

Tsinghua
arXiv:2311.15596v272 citationsh-index: 35
AI Analysis

This provides a new benchmark for evaluating first-person perspective capabilities in vision-language models, which is incremental but important for advancing embodied AI and robotics.

The authors tackled the lack of evaluation for vision-language models' first-person perspective thinking, a key capability for autonomous agents, by introducing EgoThink, a benchmark based on egocentric videos with manually annotated questions, and found that while GPT-4V performed best, all models have significant room for improvement, with model size being the most impactful factor.

Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities, with the majority focusing on the third-person perspective, and only a few addressing specific tasks from the first-person perspective. However, the capability of VLMs to "think" from a first-person perspective, a crucial attribute for advancing autonomous agents and robotics, remains largely unexplored. To bridge this research gap, we introduce EgoThink, a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from egocentric videos, with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs, we evaluate eighteen popular VLMs on EgoThink. Moreover, given the open-ended format of the answers, we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions, all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile, enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion, EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs, providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes