From Rosetta to Match-Up: A Paired Corpus of Linguistic Puzzles with Human and LLM Benchmarks
This work provides a new benchmark for evaluating linguistic reasoning in humans and machines, but the contribution is incremental as it focuses on converting existing puzzles rather than introducing a fundamentally new task.
The authors created a paired dataset of Rosetta Stone and Match-Up linguistic puzzles by converting existing puzzles, and found that both human experts and LLMs exhibit an all-or-nothing solving pattern on Match-Up puzzles.
In this paper, we examine linguistic puzzles used in high school linguistics competitions, focusing on two common formats: Rosetta Stone and Match-Up. We propose a systematic procedure for converting existing Rosetta Stone puzzles into corresponding Match-Up counterparts. Because linguistic puzzle creation is complex and time-consuming, our method provides an efficient way to accelerate the generation of new puzzles. We evaluate the resulting Rosetta Stone-Match-Up pairs with both human participants and large language models (LLMs). Our results show that both expert human solvers and LLMs display an all-or-nothing pattern on Match-Up puzzles, either solving them completely or failing entirely. This work contributes a new dataset of paired puzzles and provides a detailed evaluation of puzzle difficulty across formats, offering insights into both human and machine linguistic reasoning.