Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model
This addresses the issue of link function misspecification in LLM alignment for AI researchers and practitioners, offering a more robust approach that is incremental over existing methods.
The paper tackles the problem of aligning large language models (LLMs) to preference data without assuming a known link function, which can cause bias and misalignment. It introduces a semiparametric single-index model and develops robust optimization algorithms, demonstrating empirical improvements on LLM alignment with convergence guarantees.
Aligning large language models (LLMs) to preference data typically assumes a known link function between observed preferences and latent rewards (e.g., a logistic Bradley-Terry link). Misspecification of this link can bias inferred rewards and misalign learned policies. We study preference alignment under an unknown and unrestricted link function. We show that realizability of $f$-divergence-constrained reward maximization in a policy class induces a semiparametric single-index binary choice model, where a scalar policy-dependent index captures all dependence on demonstrations and the remaining preference distribution is unrestricted. Rather than assuming this model has identifiable finite-dimensional structural parameters and estimating them, as in econometrics, we focus on policy learning with the reward function implicit, analyzing error to the optimal policy and allowing for unidentifiable nonparametric indices. We develop preference optimization algorithms robust to the unknown link and prove convergence guarantees in terms of generic function complexity measures. We demonstrate this empirically on LLM alignment. Code is available at https://github.com/causalml/spo/