Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses
This work addresses the challenge of assessing linguistic proficiency and sequential instruction-following in large language models for researchers, but it is incremental as it builds on existing evaluation methods with a new dataset.
The researchers tackled the problem of evaluating large language models' ability to solve Italian rebuses, a constrained multi-step reasoning task, and found that while ad-hoc fine-tuning improves performance, gains are largely due to memorization, with models like LLaMA-3 and GPT-4o performing poorly.
Rebuses are puzzles requiring constrained multi-step reasoning to identify a hidden phrase from a set of images and letters. In this work, we introduce a large collection of verbalized rebuses for the Italian language and use it to assess the rebus-solving capabilities of state-of-the-art large language models. While general-purpose systems such as LLaMA-3 and GPT-4o perform poorly on this task, ad-hoc fine-tuning seems to improve models' performance. However, we find that performance gains from training are largely motivated by memorization. Our results suggest that rebus solving remains a challenging test bed to evaluate large language models' linguistic proficiency and sequential instruction-following skills.