PosterSum: A Multimodal Benchmark for Scientific Poster Summarization
This addresses the problem of summarizing scientific posters for researchers and developers, but it is incremental as it builds on existing multimodal methods with a new dataset and modest performance improvement.
The authors tackled the challenge of generating textual summaries from visually complex scientific posters by introducing PosterSum, a multimodal benchmark with 16,305 poster-abstract pairs, and proposed a hierarchical method that achieved a 3.14% gain in ROUGE-L over state-of-the-art models.
Generating accurate and concise textual summaries from multimodal documents is challenging, especially when dealing with visually complex content like scientific posters. We introduce PosterSum, a novel benchmark to advance the development of vision-language models that can understand and summarize scientific posters into research paper abstracts. Our dataset contains 16,305 conference posters paired with their corresponding abstracts as summaries. Each poster is provided in image format and presents diverse visual understanding challenges, such as complex layouts, dense text regions, tables, and figures. We benchmark state-of-the-art Multimodal Large Language Models (MLLMs) on PosterSum and demonstrate that they struggle to accurately interpret and summarize scientific posters. We propose Segment & Summarize, a hierarchical method that outperforms current MLLMs on automated metrics, achieving a 3.14% gain in ROUGE-L. This will serve as a starting point for future research on poster summarization.