CL AI CVFeb 12, 2025

What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

Dongqi Liu, Chenxi Whitehouse, Xi Yu, Louis Mahon, Rohit Saxena, Zheng Zhao, Yifu Qiu, Mirella Lapata, Vera Demberg

Cambridge

arXiv:2502.08279v414.715 citationsh-index: 86Has CodeACL

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of video-to-text summarization for scientific presentations, providing a dataset and benchmarks to support future research in this domain-specific area.

The paper tackles the problem of summarizing scientific presentation videos into text by introducing VISTA, a dataset of 18,599 AI conference videos paired with paper abstracts, and shows that a plan-based framework improves summary quality and factual consistency, though a significant gap persists between model and human performance.

Transforming recorded videos into concise and accurate textual summaries is a growing challenge in multimodal learning. This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains. VISTA contains 18,599 recorded AI conference presentations paired with their corresponding paper abstracts. We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts. Both human and automated evaluations confirm that explicit planning enhances summary quality and factual consistency. However, a considerable gap remains between models and human performance, highlighting the challenges of our dataset. This study aims to pave the way for future research on scientific video-to-text summarization.

View on arXiv PDF Code

Similar