CinePile: A Long Video Question Answering Dataset and Benchmark
This provides a benchmark for long-form video understanding, addressing a gap in current datasets for researchers in computer vision and AI.
The authors tackled the lack of genuine long-form video comprehension challenges by introducing CinePile, a dataset with 305,000 multiple-choice questions, and found that fine-tuning video-LLMs significantly improves performance, though models still underperform compared to humans.
Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset. The findings indicate that although current models underperform compared to humans, fine-tuning these models can lead to significant improvements in their performance.