Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
For researchers in egocentric and embodied AI, this benchmark provides a more detailed evaluation of reasoning steps and identifies a key limitation in current models that can be partially addressed by spatiotemporal hints.
Minerva-Ego introduces a benchmark for egocentric video reasoning with multi-step multimodal questions and spatiotemporal reasoning traces, revealing a large gap between state-of-the-art models and human performance. Providing hints about 'where' and 'when' to look significantly improves model accuracy.
Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g. the answer to a question), without evaluation of intermediate reasoning steps, and most provide answers only in the text domain. We introduce Minerva-Ego, a benchmark for evaluating complex egocentric visual reasoning. We extend recent high-quality video data sources recorded from egocentric / embodied settings with a set of challenging, multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces. Benchmarking experiments show that state-of-the-art models still have a large gap to human performance. To investigate this gap in detail, we annotate each reasoning trace in the dataset with the objects of interest required to solve the question, as spatiotemporal mask annotations. Through extensive evaluations, we identify that prompting frontier models with hints of 'where' and 'when' to look yields substantial improvements in performance. Minerva-Ego can be downloaded at https://github.com/google-deepmind/neptune.