AICLDec 18, 2025

A Solver-in-the-Loop Framework for Improving LLMs on Answer Set Programming for Logic Puzzle Solving

arXiv:2512.17093v14 citationsh-index: 10
Originality Incremental advance
AI Analysis

This work addresses the problem of improving LLMs for domain-specific language code generation, particularly for logic puzzle solving, but it is incremental as it builds on existing methods with specific enhancements.

The paper tackles the challenge of generating code for Answer Set Programming (ASP) using large language models (LLMs), which struggle due to limited pre-training examples, by introducing a solver-in-the-loop framework that uses solver feedback to guide instruction-tuning, resulting in consistent improvements across two datasets in two prompting settings.

The rise of large language models (LLMs) has sparked interest in coding assistants. While general-purpose programming languages are well supported, generating code for domain-specific languages remains a challenging problem for LLMs. In this paper, we focus on the LLM-based generation of code for Answer Set Programming (ASP), a particularly effective approach for finding solutions to combinatorial search problems. The effectiveness of LLMs in ASP code generation is currently hindered by the limited number of examples seen during their initial pre-training phase. In this paper, we introduce a novel ASP-solver-in-the-loop approach for solver-guided instruction-tuning of LLMs to addressing the highly complex semantic parsing task inherent in ASP code generation. Our method only requires problem specifications in natural language and their solutions. Specifically, we sample ASP statements for program continuations from LLMs for unriddling logic puzzles. Leveraging the special property of declarative ASP programming that partial encodings increasingly narrow down the solution space, we categorize them into chosen and rejected instances based on solver feedback. We then apply supervised fine-tuning to train LLMs on the curated data and further improve robustness using a solver-guided search that includes best-of-N sampling. Our experiments demonstrate consistent improvements in two distinct prompting settings on two datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes