LGCLCVMar 2

Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment

arXiv:2603.01950v1h-index: 4
Originality Synthesis-oriented
AI Analysis

This work addresses the need for better comic interpretation systems to support blind or visually impaired users, but it is incremental as it focuses on benchmarking and identifying issues rather than proposing a new solution.

The researchers tackled the problem of evaluating generative vision-language models for comic understanding by benchmarking their performance on interpretation tasks, and they identified and categorized hallucinations that occur, concluding with guidance for future research.

A system that enables blind or visually impaired users to access comics/manga would introduce a new medium of storytelling to this community. However, no such system currently exists. Generative vision-language models (VLMs) have shown promise in describing images and understanding comics, but most research on comic understanding is limited to panel-level analysis. To fully support blind and visually impaired users, greater attention must be paid to page-level understanding and interpretation. In this work, we present a preliminary benchmark of VLM performance on comic interpretation tasks. We identify and categorize hallucinations that emerge during this process, organizing them into generalized object-hallucination taxonomies. We conclude with guidance on future research, emphasizing hallucination mitigation and improved data curation for comic interpretation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes