CL CVMay 23, 2023

Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought

Vaishnavi Himakunthala, Andy Ouyang, Daniel Rose, Ryan He, Alex Mei, Yujie Lu, Chinmay Sonar, Michael Saxon, William Yang Wang

arXiv:2305.13903v322.5140 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the challenge of video reasoning for AI systems, offering a new dataset and tasks, but it is incremental as it builds on existing vision-language methods.

The paper tackles the problem of video reasoning by framing it as sequential understanding of keyframes, introducing the VIP dataset for evaluating video chain-of-thought, and benchmarks models like GPT-4, showing a performance gap in tasks such as Video Infilling and Prediction.

Despite exciting recent results showing vision-language systems' capacity to reason about images using natural language, their capacity for video reasoning remains under-explored. We motivate framing video reasoning as the sequential understanding of a small number of keyframes, thereby leveraging the power and robustness of vision-language while alleviating the computational complexities of processing videos. To evaluate this novel application, we introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought. Inspired by visually descriptive scene plays, we propose two formats for keyframe description: unstructured dense captions and structured scene descriptions that identify the focus, action, mood, objects, and setting (FAMOuS) of the keyframe. To evaluate video reasoning, we propose two tasks: Video Infilling and Video Prediction, which test abilities to generate multiple intermediate keyframes and predict future keyframes, respectively. We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in these complex video reasoning tasks, and encourage future work to prioritize language models for efficient and generalized video reasoning.

View on arXiv PDF Code

Similar