LGAIOct 28, 2025

Semi-Supervised Preference Optimization with Limited Feedback

arXiv:2511.00040v12 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses the resource-intensive nature of preference optimization for language model alignment, offering a more data-efficient solution that is incremental but impactful for reducing acquisition costs.

The paper tackles the problem of reducing reliance on large labeled datasets for preference optimization by introducing Semi-Supervised Preference Optimization (SSPO), which uses a small set of labeled data and a large pool of unlabeled data to achieve human alignment with lower costs. The result shows that SSPO trained with Llama3-8B-Instruct on 1% of UltraFeedback outperforms baselines trained on 10% of the dataset.

The field of preference optimization has made outstanding contributions to the alignment of language models with human preferences. Despite these advancements, recent methods still rely heavily on substantial paired (labeled) feedback data, leading to substantial resource expenditures. To address these challenges, we study the problem of Semi-Supervised Preference Optimization (SSPO) in which the idea is to learn from both a small number of pairwise preference labels and a large pool of unpaired samples simultaneously. Our key theoretical contribution proves the existence of an optimal reward threshold capable of separating winning and losing responses with high probability, which enables a principled pseudo-labeling of unpaired data. By leveraging these pseudo-labels, SSPO effectively distills latent preferences from large-scale unpaired data, thus maintaining human alignment while drastically reducing acquisition costs. Extensive experiments across datasets validate this remarkable data efficiency; for instance, SSPO trained with Llama3-8B-Instruct on just 1% of UltraFeedback consistently surpasses strong baselines trained on 10% of UltraFeedback.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes