PMPO: Probabilistic Metric Prompt Optimization for Small and Large Language Models
This addresses scalability issues in prompt optimization for both small and large language models, offering a more efficient alternative to fine-tuning.
The paper tackles the problem of inefficient prompt optimization by introducing PMPO, a framework that uses token-level cross entropy to evaluate and improve prompts without sampling full outputs, achieving state-of-the-art accuracy on benchmarks like BBH and increasing AlpacaEval 2.0 win rates by over 19 points.
Prompt optimization is a practical and widely applicable alternative to fine tuning for improving large language model performance. Yet many existing methods evaluate candidate prompts by sampling full outputs, often coupled with self critique or human annotated preferences, which limits scalability, especially for smaller models or models that are not instruction tuned. We present PMPO (Probabilistic Metric Prompt Optimization), a unified framework that uses token level cross entropy as a direct, lightweight evaluation signal. PMPO locates low quality prompt segments via a masking based analysis and iteratively rewrites them to propose improved variants. Crucially, during evaluation, PMPO selects among variants by minimizing loss in a single forward pass, eliminating output sampling and human or judge based scoring for selection while still using standard generation only to propose rewrites. This unified, loss based strategy supports both supervised and preference based tasks. Across model sizes and datasets, PMPO outperforms prior prompt optimizers: it achieves the highest average accuracy on BBH, performs strongly on GSM8K and AQUA RAT, and raises AlpacaEval 2.0 win rates by over 19 points. These results demonstrate PMPO's effectiveness, efficiency, and broad applicability.