Multimodal Datasets and Benchmarks for Reasoning about Dynamic Spatio-Temporality in Everyday Environments
This work addresses the need for standardized benchmarks in Embodied AI to assess reasoning in dynamic spatio-temporal contexts, though it appears incremental as it builds on existing simulation and annotation methods.
The researchers tackled the problem of measuring AI's understanding of human behavior and environments in home settings by creating a multimodal dataset using a 3D simulator, with preliminary experiments indicating its utility for evaluating AI comprehension of daily life.
We used a 3D simulator to create artificial video data with standardized annotations, aiming to aid in the development of Embodied AI. Our question answering (QA) dataset measures the extent to which a robot can understand human behavior and the environment in a home setting. Preliminary experiments suggest our dataset is useful in measuring AI's comprehension of daily life. \end{abstract}