Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
This work addresses the challenge of generalizing video understanding to novel domains for embodied agents, though it is incremental as it builds on existing benchmarks and methods.
The authors tackled the problem of multimodal and long-form procedural video understanding by introducing Spacewalk-18, a benchmark with step recognition and video question answering tasks based on International Space Station spacewalk recordings, and they discovered a summarization technique that significantly improved performance without fine-tuning.
Learning from (procedural) videos has increasingly served as a pathway for embodied agents to acquire skills from human demonstrations. To do this, video understanding models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel environments, tasks, and problem domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) video question answering, over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to: (1) generalize to novel domains; (2) utilize long temporal context and multimodal (e.g. visual and speech) information. Our extensive experimental analysis highlights the challenges of Spacewalk-18, but also suggests best practices for domain generalization and long-form understanding. Notably, we discover a promising adaptation via summarization technique that leads to significant performance improvement without model fine-tuning. The Spacewalk-18 benchmark is released at https://brown-palm.github.io/Spacewalk-18/.