Solving and Generating NPR Sunday Puzzles with Large Language Models
This addresses the challenge of puzzle-solving and generation for AI researchers, but it is incremental as it builds on existing models and datasets.
The study tackled the problem of solving and generating NPR Sunday Puzzles using large language models, finding that GPT-3.5 achieved 50.2% loose accuracy in solving puzzles but failed to generate valid puzzles as answers did not conform to rules.
We explore the ability of large language models to solve and generate puzzles from the NPR Sunday Puzzle game show using PUZZLEQA, a dataset comprising 15 years of on-air puzzles. We evaluate four large language models using PUZZLEQA, in both multiple choice and free response formats, and explore two prompt engineering techniques to improve free response performance: chain-of-thought reasoning and prompt summarization. We find that state-of-the-art large language models can solve many PUZZLEQA puzzles: the best model, GPT-3.5, achieves 50.2% loose accuracy. However, in our few-shot puzzle generation experiment, we find no evidence that models can generate puzzles: GPT-3.5 generates puzzles with answers that do not conform to the generated rules. Puzzle generation remains a challenging task for future work.