LGCLSep 16, 2025

Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety

arXiv:2509.12936v1h-index: 2
Originality Synthesis-oriented
AI Analysis

This work addresses the need for holistic evaluation of alignment methods to guide the development of more balanced and reliable LLMs, though it is incremental as it compares existing methods rather than introducing new ones.

The paper tackled the problem of evaluating alignment methods for large language models by proposing a unified framework that assesses trade-offs across five axes, revealing that DPO and KTO excel in factual accuracy, PPO and DPO lead in safety, and PPO best balances conciseness with proactivity.

Large language models (LLMs) require careful alignment to balance competing objectives - factuality, safety, conciseness, proactivity, and diversity. Existing studies focus on individual techniques or specific dimensions, lacking a holistic assessment of the inherent trade-offs. We propose a unified evaluation framework that compares LLM alignment methods (PPO, DPO, ORPO, KTO) across these five axes, using both in-distribution and out-of-distribution datasets. Leveraging a specialized LLM-as-Judge prompt, validated through human studies, we reveal that DPO and KTO excel in factual accuracy, PPO and DPO lead in safety, and PPO best balances conciseness with proactivity. Our findings provide insights into trade-offs of common alignment methods, guiding the development of more balanced and reliable LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes