Preference Robustness for DPO with Applications to Public Health
This addresses alignment challenges in public health applications with ambiguous objectives and limited data, though it is incremental as it builds on existing DPO and DRO methods.
The paper tackled the problem of fine-tuning LLMs for reward functions in sequential resource allocation for public health, using human preferences, and proposed DPO-PRO, a robust algorithm that improves robustness to noisy preferences and achieves comparable performance to baselines with lower inference cost.
We study an LLM fine-tuning task for designing reward functions for sequential resource allocation problems in public health, guided by human preferences expressed in natural language. This setting presents a challenging testbed for alignment due to complex and ambiguous objectives and limited data availability. We propose DPO-PRO, a robust fine-tuning algorithm based on Direct Preference Optimization (DPO), which accounts for uncertainty in the preference distribution using a lightweight Distributionally Robust Optimization (DRO) formulation. Unlike prior DRO-based DPO methods, DPO-PRO is significantly less conservative. We evaluate DPO-PRO on a real-world maternal mobile health program operated by the non-profit organization ARMMAN, as well as on standard alignment benchmarks. Experimental results demonstrate that our method consistently improves robustness to noisy preference signals compared to existing DPO variants. Moreover, DPO-PRO achieves comparable performance to prior self-reflection-based baseline for reward function design, while requiring significantly lower inference-time cost.