CVCLROOct 26, 2025

Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views

arXiv:2510.22672v21 citationsh-index: 4
Originality Synthesis-oriented
AI Analysis

This dataset provides a benchmark for developing embodied agents that can understand situated dialogue, addressing a domain-specific need in multimodal AI.

The authors tackled the problem of referential communication across different spatial perspectives by introducing the Look and Tell dataset, which includes 3.67 hours of recordings and 2,707 annotated expressions from 25 participants using smart glasses and cameras in a kitchen setting.

We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes