CLMar 13

PrefPO: Pairwise Preference Prompt Optimization

Rahul Singhal, Pradyumna Tambwekar, Karime Maamari

arXiv:2603.1931155.0h-index: 10

AI Analysis

This addresses the challenge of automated prompt optimization for AI practitioners by offering a method that works in both labeled and unlabeled settings, though it builds incrementally on RLHF-inspired approaches.

The paper tackles the problem of labor-intensive prompt engineering by introducing PrefPO, a pairwise preference-based prompt optimization method that reduces the need for labeled data and hyperparameter tuning, achieving SOTA or comparable performance on 6/9 BIG-Bench Hard tasks and IFEval-Hard (82.4% vs 84.5%) while improving prompt hygiene by reducing length and repetition issues by 3-5x.

Prompt engineering is effective but labor-intensive, motivating automated optimization methods. Existing methods typically require labeled datasets, which are often unavailable, and produce verbose, repetitive prompts. We introduce PrefPO, a minimal prompt optimization approach inspired by reinforcement learning from human feedback (RLHF). Its preference-based approach reduces the need for labeled data and hyperparameter tuning-only a starting prompt and natural language criteria are needed. PrefPO uses an LLM discriminator to express pairwise preferences over model outputs and provide feedback to an LLM optimizer, iteratively improving performance. We evaluate PrefPO on 9 BIG-Bench Hard (BBH) tasks and IFEval-Hard, a newly-curated, challenging subset of IFEval. PrefPO matches or exceeds SOTA methods, including GEPA, MIPRO, and TextGrad, on 6/9 tasks and performs comparably to TextGrad on IFEval-Hard (82.4% vs 84.5%). Unlike other methods, PrefPO can optimize in both labeled and unlabeled settings. Without labels, PrefPO closely matches its labeled performance on 6/9 tasks, proving effective without ground truth. PrefPO also improves prompt hygiene: we find existing methods produce prompts 14.7x their original length or with 34% repetitive content; PrefPO reduces these issues by 3-5x. Furthermore, both LLM and human judges rate PrefPO's prompts higher than TextGrad's. Finally, we identify prompt hacking in prompt optimizers, where methods game evaluation criteria, and find PrefPO is susceptible at half the rate of TextGrad (37% vs 86%), generating fewer brittle, misaligned prompts.

View on arXiv PDF

Similar