AILGDec 29, 2025

InSPO: Unlocking Intrinsic Self-Reflection for LLM Preference Optimization

arXiv:2512.23126v36 citationsh-index: 3Has Code
Originality Highly original
AI Analysis

This work addresses fundamental issues in LLM preference optimization for researchers and practitioners, offering a plug-and-play enhancement that improves alignment without architectural changes.

The paper tackles the limitations of Direct Preference Optimization (DPO) in aligning Large Language Models by proposing InSPO, which leverages intrinsic self-reflection to condition on both context and alternative responses, resulting in consistent improvements in win rates and length-controlled metrics.

Direct Preference Optimization (DPO) and its variants have become standard for aligning Large Language Models due to their simplicity and offline stability. However, we identify two fundamental limitations. First, the optimal policy depends on arbitrary modeling choices (scalarization function, reference policy), yielding behavior reflecting parameterization artifacts rather than true preferences. Second, treating response generation in isolation fails to leverage comparative information in pairwise data, leaving the model's capacity for intrinsic self-reflection untapped. To address it, we propose Intrinsic Self-reflective Preference Optimization (InSPO), deriving a globally optimal policy conditioning on both context and alternative responses. We prove this formulation superior to DPO/RLHF while guaranteeing invariance to scalarization and reference choices. InSPO serves as a plug-and-play enhancement without architectural changes or inference overhead. Experiments demonstrate consistent improvements in win rates and length-controlled metrics, validating that unlocking self-reflection yields more robust, human-aligned LLMs. Our Code is available at https://github.com/Skylanding/InSPO.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes