CL AI CVJan 31, 2025

Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SciCap Challenge 2023

Ting-Yao E. Hsu, Yi-Li Hsu, Shaurya Rohatgi, Chieh-Yang Huang, Ho Yin Sam Ng, Ryan Rossi, Sungchul Kim, Tong Yu, Lun-Wei Ku, C. Lee Giles, Ting-Hao K. Huang

arXiv:2501.19353v36.73 citationsh-index: 27

Originality Synthesis-oriented

AI Analysis

This work assesses whether advanced LMMs have solved the problem of generating high-quality captions for scientific figures, which is crucial for improving accessibility and understanding in scholarly articles.

The paper evaluated large multimodal models (LMMs) on caption generation for scientific figures using the SciCap Challenge 2023 data, finding that GPT-4V-generated captions were overwhelmingly preferred by professional editors over other models and even original author-written captions.

Since the SciCap datasets launch in 2021, the research community has made significant progress in generating captions for scientific figures in scholarly articles. In 2023, the first SciCap Challenge took place, inviting global teams to use an expanded SciCap dataset to develop models for captioning diverse figure types across various academic fields. At the same time, text generation models advanced quickly, with many powerful pre-trained large multimodal models (LMMs) emerging that showed impressive capabilities in various vision-and-language tasks. This paper presents an overview of the first SciCap Challenge and details the performance of various models on its data, capturing a snapshot of the fields state. We found that professional editors overwhelmingly preferred figure captions generated by GPT-4V over those from all other models and even the original captions written by authors. Following this key finding, we conducted detailed analyses to answer this question: Have advanced LMMs solved the task of generating captions for scientific figures?

View on arXiv PDF

Similar