CVMar 9

Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time

arXiv:2603.07966v1
Predicted impact top 7% in CV · last 90 daysOriginality Highly original
AI Analysis

This work provides a strict benchmark for evaluating multimodal language models on their ability to perform event-level speech-gesture binding, which is crucial for situated collaboration and understanding deictic interactions.

This paper introduces Egocentric Co-Speech Grounding (EcoG) to address the problem of grounding underspecified deictic commands by aligning speech with co-speech pointing strokes. They present EcoG-Bench, a bilingual diagnostic benchmark of 811 egocentric clips, and show that human subjects achieve 96.9% accuracy, while state-of-the-art MLLMs achieve only 17.0% accuracy, highlighting a significant executability gap.

In situated collaboration, speakers often use intentionally underspecified deictic commands (e.g., ``pass me \textit{that}''), whose referent becomes identifiable only by aligning speech with a brief co-speech pointing \emph{stroke}. However, many embodied benchmarks admit language-only shortcuts, allowing MLLMs to perform well without learning the \emph{audio--visual alignment} required by deictic interaction. To bridge this gap, we introduce \textbf{Egocentric Co-Speech Grounding (EcoG)}, where grounding is executable only if an agent jointly predicts \textit{What}, \textit{Where}, and \textit{When}. To operationalize this, we present \textbf{EcoG-Bench}, an evaluation-only bilingual (EN/ZH) diagnostic benchmark of \textbf{811} egocentric clips with dense spatial annotations and millisecond-level stroke supervision. It is organized under a \textbf{Progressive Cognitive Evaluation} protocol. Benchmarking state-of-the-art MLLMs reveals a severe executability gap: while human subjects achieve near-ceiling performance on EcoG-Bench (\textbf{96.9\%} strict Eco-Accuracy), the best native video-audio setting remains low (Gemini-3-Pro: \textbf{17.0\%}). Moreover, in a diagnostic ablation, replacing the native video--audio interface with timestamped frame samples and externally verified ASR (with word-level timing) substantially improves the same model (\textbf{17.0\%}$\to$\textbf{42.9\%}). Overall, EcoG-Bench provides a strict, executable testbed for event-level speech--gesture binding, and suggests that multimodal interfaces may bottleneck the observability of temporal alignment cues, independently of model reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes