CLAICVLGMay 29, 2025

Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

arXiv:2505.23759v23 citationsh-index: 10EMNLP
Originality Synthesis-oriented
AI Analysis

This work addresses a specific problem in multi-modal AI for puzzle-solving, but it is incremental as it primarily benchmarks existing models without introducing new methods.

The paper tackled the challenge of rebus puzzles for vision-language models (VLMs), constructing a benchmark and finding that VLMs struggle with abstract reasoning and visual metaphors, showing only limited success on simple clues.

Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head" over "heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes