CVCLApr 1, 2025

WikiVideo: Article Generation from Multiple Videos

arXiv:2504.00939v29 citationsh-index: 16
Originality Highly original
AI Analysis

This addresses the gap in video-based summarization for high-level event semantics, enabling more accurate and grounded content creation from multimodal sources.

The paper tackles the problem of generating Wikipedia-style articles from multiple videos about real-world events, ensuring all information is video-grounded, and introduces a benchmark and a method that outperforms existing VideoLLMs in retrieval-augmented generation settings.

We introduce the task of grounded article generation with the goal of creating a Wikipedia-style article from multiple diverse videos about real-world events -- from natural disasters to political elections -- where all the information in the article is supported by video evidence. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text while existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher-level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes