CVMar 11, 2025

ComicsPAP: understanding comic strips by picking the correct panel

Emanuele Vivoli, Artemis Llabrés, Mohamed Ali Souibgui, Marco Bertini, Ernest Valveny Llobet, Dimosthenis Karatzas

arXiv:2503.08561v38.42 citationsh-index: 7ICDAR

Originality Synthesis-oriented

AI Analysis

This addresses the problem of limited multimodal comprehension of comics for AI researchers, though it is incremental as it focuses on benchmarking and adaptation rather than a new paradigm.

The authors tackled the challenge of comic strip understanding by creating ComicsPAP, a large-scale benchmark with over 100k samples, and found that current state-of-the-art large multimodal models perform near chance on its tasks, while their adapted models achieved better results than 10x bigger models.

Large multimodal models (LMMs) have made impressive strides in image captioning, VQA, and video comprehension, yet they still struggle with the intricate temporal and spatial cues found in comics. To address this gap, we introduce ComicsPAP, a large-scale benchmark designed for comic strip understanding. Comprising over 100k samples and organized into 5 subtasks under a Pick-a-Panel framework, ComicsPAP demands models to identify the missing panel in a sequence. Our evaluations, conducted under both multi-image and single-image protocols, reveal that current state-of-the-art LMMs perform near chance on these tasks, underscoring significant limitations in capturing sequential and contextual dependencies. To close the gap, we adapted LMMs for comic strip understanding, obtaining better results on ComicsPAP than 10x bigger models, demonstrating that ComicsPAP offers a robust resource to drive future research in multimodal comic comprehension.

View on arXiv PDF

Similar