CL AI CV LGMay 29, 2025

Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

Heekyung Lee, Jiaxin Ge, Tsung-Han Wu, Minwoo Kang, Trevor Darrell, David M. Chan

arXiv:2505.23759v29.63 citationsh-index: 10Has CodeEMNLP

Originality Synthesis-oriented

AI Analysis

This work addresses a specific problem in multi-modal AI for puzzle-solving, but it is incremental as it primarily benchmarks existing models without introducing new methods.

The paper tackled the challenge of rebus puzzles for vision-language models (VLMs), constructing a benchmark and finding that VLMs struggle with abstract reasoning and visual metaphors, showing only limited success on simple clues.

Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head" over "heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.

View on arXiv PDF Code

Similar