CL CYNov 9, 2025

SPA: Achieving Consensus in LLM Alignment via Self-Priority Optimization

Yue Huang, Xiangqi Wang, Xiangliang Zhang

arXiv:2511.06222v18.32 citationsh-index: 8

Originality Highly original

AI Analysis

This addresses the problem of aligning LLMs for critical applications like self-harm, legal, or medical queries, offering a scalable and interpretable strategy, though it appears incremental as it builds on existing alignment methods with a novel ordering approach.

The paper tackles the conflict between trustworthiness and helpfulness in LLMs for high-stakes scenarios by proposing priority alignment, a paradigm that enforces a 'trustworthy-before-helpful' ordering, and introduces Self-Priority Alignment (SPA), an unsupervised framework that improves helpfulness without compromising safety, outperforming strong baselines in experiments.

In high-stakes scenarios-such as self-harm, legal, or medical queries-LLMs must be both trustworthy and helpful. However, these goals often conflict. We propose priority alignment, a new alignment paradigm that enforces a strict "trustworthy-before-helpful" ordering: optimization of helpfulness is conditioned on first meeting trustworthy thresholds (e.g., harmlessness or honesty). To realize this, we introduce Self-Priority Alignment (SPA)-a fully unsupervised framework that generates diverse responses, self-evaluates them and refines them by the model itself, and applies dual-criterion denoising to remove inconsistency and control variance. From this, SPA constructs lexicographically ordered preference pairs and fine-tunes the model using an uncertainty-weighted alignment loss that emphasizes high-confidence, high-gap decisions. Experiments across multiple benchmarks show that SPA improves helpfulness without compromising safety, outperforming strong baselines while preserving general capabilities. Our results demonstrate that SPA provides a scalable and interpretable alignment strategy for critical LLM applications.

View on arXiv PDF

Similar