CV MMJul 26, 2024

MangaUB: A Manga Understanding Benchmark for Large Multimodal Models

Hikaru Ikuta, Leslie Wöhler, Kiyoharu Aizawa

arXiv:2407.19034v18.76 citationsh-index: 4

Originality Synthesis-oriented

AI Analysis

This work addresses the need for better evaluation of LMMs in manga understanding, which is important for researchers and developers in multimodal AI, though it is incremental as it focuses on benchmarking rather than novel methods.

The authors tackled the problem of evaluating large multimodal models (LMMs) for manga understanding by creating MangaUB, a benchmark that tests recognition and cross-panel comprehension; results showed strong performance on image content recognition but challenges in understanding emotion and information across panels.

Manga is a popular medium that combines stylized drawings and text to convey stories. As manga panels differ from natural images, computational systems traditionally had to be designed specifically for manga. Recently, the adaptive nature of modern large multimodal models (LMMs) shows possibilities for more general approaches. To provide an analysis of the current capability of LMMs for manga understanding tasks and identifying areas for their improvement, we design and evaluate MangaUB, a novel manga understanding benchmark for LMMs. MangaUB is designed to assess the recognition and understanding of content shown in a single panel as well as conveyed across multiple panels, allowing for a fine-grained analysis of a model's various capabilities required for manga understanding. Our results show strong performance on the recognition of image content, while understanding the emotion and information conveyed across multiple panels is still challenging, highlighting future work towards LMMs for manga understanding.

View on arXiv PDF

Similar