CVAICLSep 12, 2025

Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics

arXiv:2509.12248v24 citationsh-index: 9EMNLP
Originality Synthesis-oriented
AI Analysis

This addresses the problem of improving social intelligence in AI for more natural interactions, though it is incremental as it focuses on benchmarking rather than proposing a new method.

The paper tackles the challenge of humor understanding in Large Multimodal Models (LMMs) by introducing PixelHumor, a benchmark dataset of 2,800 annotated comics, and finds that top models achieve only 61% accuracy in panel sequencing, highlighting significant gaps compared to human performance.

Understanding humor is a core aspect of social intelligence, yet it remains a significant challenge for Large Multimodal Models (LMMs). We introduce PixelHumor, a benchmark dataset of 2,800 annotated multi-panel comics designed to evaluate LMMs' ability to interpret multimodal humor and recognize narrative sequences. Experiments with state-of-the-art LMMs reveal substantial gaps: for instance, top models achieve only 61% accuracy in panel sequencing, far below human performance. This underscores critical limitations in current models' integration of visual and textual cues for coherent narrative and humor understanding. By providing a rigorous framework for evaluating multimodal contextual and narrative reasoning, PixelHumor aims to drive the development of LMMs that better engage in natural, socially aware interactions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes