CVJul 14, 2025

CoSMo: A Multimodal Transformer for Page Stream Segmentation in Comic Books

arXiv:2507.10053v1h-index: 72025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Originality Incremental advance
AI Analysis

This addresses a critical task for automated comic book content understanding, enabling downstream applications like character analysis and story indexing, but is incremental as it builds on existing Transformer methods for a specific domain.

The paper tackles the problem of Page Stream Segmentation in comic books by introducing CoSMo, a multimodal Transformer, which outperforms baselines and larger models across metrics like F1-Macro and Panoptic Quality on a new 20,800-page dataset.

This paper introduces CoSMo, a novel multimodal Transformer for Page Stream Segmentation (PSS) in comic books, a critical task for automated content understanding, as it is a necessary first stage for many downstream tasks like character analysis, story indexing, or metadata enrichment. We formalize PSS for this unique medium and curate a new 20,800-page annotated dataset. CoSMo, developed in vision-only and multimodal variants, consistently outperforms traditional baselines and significantly larger general-purpose vision-language models across F1-Macro, Panoptic Quality, and stream-level metrics. Our findings highlight the dominance of visual features for comic PSS macro-structure, yet demonstrate multimodal benefits in resolving challenging ambiguities. CoSMo establishes a new state-of-the-art, paving the way for scalable comic book analysis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes