CVMar 9

Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time

Weijie Zhou, Xuantang Xiong, Zhenlin Hu, Xiaomeng Zhu, Chaoyang Zhao, Honghui Dong, Zhengyou Zhang, Ming Tang, Jinqiao Wang

arXiv:2603.07966v121.0h-index: 4

Predicted impact top 7% in CV · last 90 daysOriginality Highly original

AI Analysis

This work provides a strict benchmark for evaluating multimodal language models on their ability to perform event-level speech-gesture binding, which is crucial for situated collaboration and understanding deictic interactions.

This paper introduces Egocentric Co-Speech Grounding (EcoG) to address the problem of grounding underspecified deictic commands by aligning speech with co-speech pointing strokes. They present EcoG-Bench, a bilingual diagnostic benchmark of 811 egocentric clips, and show that human subjects achieve 96.9% accuracy, while state-of-the-art MLLMs achieve only 17.0% accuracy, highlighting a significant executability gap.

In situated collaboration, speakers often use intentionally underspecified deictic commands (e.g., ``pass me \textit{that}''), whose referent becomes identifiable only by aligning speech with a brief co-speech pointing \emph{stroke}. However, many embodied benchmarks admit language-only shortcuts, allowing MLLMs to perform well without learning the \emph{audio--visual alignment} required by deictic interaction. To bridge this gap, we introduce \textbf{Egocentric Co-Speech Grounding (EcoG)}, where grounding is executable only if an agent jointly predicts \textit{What}, \textit{Where}, and \textit{When}. To operationalize this, we present \textbf{EcoG-Bench}, an evaluation-only bilingual (EN/ZH) diagnostic benchmark of \textbf{811} egocentric clips with dense spatial annotations and millisecond-level stroke supervision. It is organized under a \textbf{Progressive Cognitive Evaluation} protocol. Benchmarking state-of-the-art MLLMs reveals a severe executability gap: while human subjects achieve near-ceiling performance on EcoG-Bench (\textbf{96.9\%} strict Eco-Accuracy), the best native video-audio setting remains low (Gemini-3-Pro: \textbf{17.0\%}). Moreover, in a diagnostic ablation, replacing the native video--audio interface with timestamped frame samples and externally verified ASR (with word-level timing) substantially improves the same model (\textbf{17.0\%}$\to$\textbf{42.9\%}). Overall, EcoG-Bench provides a strict, executable testbed for event-level speech--gesture binding, and suggests that multimodal interfaces may bottleneck the observability of temporal alignment cues, independently of model reasoning.

View on arXiv PDF

Similar