LGFeb 22

Back to Blackwell: Closing the Loop on Intransitivity in Multi-Objective Preference Fine-Tuning

Peking U
arXiv:2602.19041v1h-index: 9
Originality Highly original
AI Analysis

This addresses a core challenge in fine-tuning large language models from multi-objective feedback, offering a novel approach for handling intransitivity without scalarization.

The paper tackles the problem of intransitive preferences in multi-objective preference fine-tuning by proposing a game-theoretic solution concept and an efficient algorithm, PROSPER, which outperforms baselines on instruction following and general chat benchmarks with released model checkpoints at 7B and 3B scales.

A recurring challenge in preference fine-tuning (PFT) is handling $\textit{intransitive}$ (i.e., cyclic) preferences. Intransitive preferences often stem from either $\textit{(i)}$ inconsistent rankings along a single objective or $\textit{(ii)}$ scalarizing multiple objectives into a single metric. Regardless of their source, the downstream implication of intransitive preferences is the same: there is no well-defined optimal policy, breaking a core assumption of the standard PFT pipeline. In response, we propose a novel, game-theoretic solution concept -- the $\textit{Maximum Entropy Blackwell Winner}$ ($\textit{MaxEntBW}$) -- that is well-defined under multi-objective intransitive preferences. To enable computing MaxEntBWs at scale, we derive $\texttt{PROSPER}$: a provably efficient PFT algorithm. Unlike prior self-play techniques, $\texttt{PROSPER}$ directly handles multiple objectives without requiring scalarization. We then apply $\texttt{PROSPER}$ to the problem of fine-tuning large language models (LLMs) from multi-objective LLM-as-a-Judge feedback (e.g., rubric-based judges), a setting where both sources of intransitivity arise. We find that $\texttt{PROSPER}$ outperforms all baselines considered across both instruction following and general chat benchmarks, releasing trained model checkpoints at the 7B and 3B parameter scales.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes