Dynamic Policy Induction for Adaptive Prompt Optimization: Bridging the Efficiency-Accuracy Gap via Lightweight Reinforcement Learning
This work addresses the problem of computational waste in LLM prompting for researchers and practitioners, offering an incremental improvement through adaptive strategy selection.
The paper tackles the efficiency-accuracy trade-off in prompting strategies for Large Language Models by introducing the Prompt Policy Network, a lightweight reinforcement learning framework that adaptively selects strategies, achieving up to 61.5% token cost reduction compared to Self-Consistency while maintaining competitive accuracy on arithmetic reasoning benchmarks.
The performance of Large Language Models (LLMs) depends heavily on the chosen prompting strategy, yet static approaches such as Zero-Shot, Few-Shot, or Chain-of-Thought (CoT) impose a rigid efficiency-accuracy trade-off. Highly accurate strategies like Self-Consistency (SC) incur substantial computational waste on simple tasks, while lightweight methods often fail on complex inputs. This paper introduces the Prompt Policy Network (PPN), a lightweight reinforcement learning framework that formalizes adaptive strategy selection as a single-step Markov Decision Process (MDP). The PPN, trained with Proximal Policy Optimization (PPO) and guided by a resource-explicit reward function, learns to allocate costly reasoning strategies only when necessary. Experiments on arithmetic reasoning benchmarks demonstrate that PPN achieves superior performance on the efficiency-accuracy Pareto front, delivering up to 61.5% token cost reduction compared to Self-Consistency while maintaining competitive accuracy. This work contributes a systematic, adaptive framework for cost-efficient LLM deployment, advancing the design of lightweight optimization techniques for scalable and sustainable language model applications.