Online Prompt Selection for Program Synthesis
This addresses the challenge for non-expert users in program synthesis to optimize performance and cost by automating solver selection, though it is incremental as it applies existing bandit methods to a new application area.
The paper tackles the problem of selecting the best solver, LLM, and prompt combination for program synthesis tasks by framing it as an online learning problem and using a multi-armed bandit algorithm to maximize reward functions like solving time or number of tasks solved. The result is CYANEA, which solves 37.2% more queries than the best single solver and achieves results within 4% of the virtual best solver.
Large Language Models (LLMs) demonstrate impressive capabilities in the domain of program synthesis. This level of performance is not, however, universal across all tasks, all LLMs and all prompting styles. There are many areas where one LLM dominates, one prompting style dominates, or where calling a symbolic solver is a better choice than an LLM. A key challenge for the user then, is to identify not only when an LLM is the right choice of solver, and the appropriate LLM to call for a given synthesis task, but also the right way to call it. A non-expert user who makes the wrong choice, incurs a cost both in terms of results (number of tasks solved, and the time it takes to solve them) and financial cost, if using a closed-source language model via a commercial API. We frame this choice as an online learning problem. We use a multi-armed bandit algorithm to select which symbolic solver, or LLM and prompt combination to deploy in order to maximize a given reward function (which may prioritize solving time, number of synthesis tasks solved, or financial cost of solving). We implement an instance of this approach, called CYANEA, and evaluate it on synthesis queries from the literature in ranking function synthesis, from the syntax-guided synthesis competition, and fresh, unseen queries generated from SMT problems. CYANEA solves 37.2% more queries than the best single solver and achieves results within 4% of the virtual best solver.