MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos
This provides a new dataset for researchers in multimodal AI to address limitations in existing MSMO datasets, though it is incremental as it focuses on data curation rather than novel methods.
The authors tackled the lack of comprehensive datasets for multimodal summarization with multimodal output (MSMO) by curating the MMSum dataset, which includes human-validated summaries for video and text, extensive categorization across 17 principal and 170 subcategories, and benchmark tests for various summarization tasks.
Multimodal summarization with multimodal output (MSMO) has emerged as a promising research direction. Nonetheless, numerous limitations exist within existing public MSMO datasets, including insufficient maintenance, data inaccessibility, limited size, and the absence of proper categorization, which pose significant challenges. To address these challenges and provide a comprehensive dataset for this new direction, we have meticulously curated the \textbf{MMSum} dataset. Our new dataset features (1) Human-validated summaries for both video and textual content, providing superior human instruction and labels for multimodal learning. (2) Comprehensively and meticulously arranged categorization, spanning 17 principal categories and 170 subcategories to encapsulate a diverse array of real-world scenarios. (3) Benchmark tests performed on the proposed dataset to assess various tasks and methods, including \textit{video summarization}, \textit{text summarization}, and \textit{multimodal summarization}. To champion accessibility and collaboration, we will release the \textbf{MMSum} dataset and the data collection tool as fully open-source resources, fostering transparency and accelerating future developments. Our project website can be found at~\url{https://mmsum-dataset.github.io/}