CVMar 6, 2024

Multimodal Transformer for Comics Text-Cloze

arXiv:2403.03719v16 citationsh-index: 7ICDAR
Originality Incremental advance
AI Analysis

This work addresses a specific problem in comics analysis for researchers and developers, offering incremental improvements in multimodal understanding.

The paper tackled the Text-cloze task in comics, which involves selecting correct text for a comic panel based on neighboring panels, by introducing a novel Multimodal Large Language Model architecture that achieved a 10% improvement over state-of-the-art models in easy and hard variants, with additional gains from new OCR annotations.

This work explores a closure task in comics, a medium where visual and textual elements are intricately intertwined. Specifically, Text-cloze refers to the task of selecting the correct text to use in a comic panel, given its neighboring panels. Traditional methods based on recurrent neural networks have struggled with this task due to limited OCR accuracy and inherent model limitations. We introduce a novel Multimodal Large Language Model (Multimodal-LLM) architecture, specifically designed for Text-cloze, achieving a 10% improvement over existing state-of-the-art models in both its easy and hard variants. Central to our approach is a Domain-Adapted ResNet-50 based visual encoder, fine-tuned to the comics domain in a self-supervised manner using SimCLR. This encoder delivers comparable results to more complex models with just one-fifth of the parameters. Additionally, we release new OCR annotations for this dataset, enhancing model input quality and resulting in another 1% improvement. Finally, we extend the task to a generative format, establishing new baselines and expanding the research possibilities in the field of comics analysis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes