AI CL LGJul 26, 2025

PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training

Sarat Chandra Bobbili, Ujwal Dinesha, Dheeraj Narasimha, Srinivas Shakkottai

arXiv:2507.20067v23 citationsh-index: 4

Originality Incremental advance

AI Analysis

This addresses the challenge of unstable reward model training for LLM alignment, offering a more efficient method for end-users in tasks like mathematical reasoning and sentiment classification, though it appears incremental as it builds on existing inference-time alignment approaches.

The paper tackles the problem of aligning large language models (LLM) outputs with user preferences during inference without fine-tuning, by introducing PITA, a framework that eliminates the need for a pre-trained reward model and reduces computational costs.

Inference-time alignment enables large language models (LLMs) to generate outputs aligned with end-user preferences without further training. Recent post-training methods achieve this by using small guidance models to modify token generation during inference. These methods typically optimize a reward function KL-regularized by the original LLM taken as the reference policy. A critical limitation, however, is their dependence on a pre-trained reward model, which requires fitting to human preference feedback--a potentially unstable process. In contrast, we introduce PITA, a novel framework that integrates preference feedback directly into the LLM's token generation, eliminating the need for a reward model. PITA learns a small preference-based guidance policy to modify token probabilities at inference time without LLM fine-tuning, reducing computational cost and bypassing the pre-trained reward model dependency. The problem is framed as identifying an underlying preference distribution, solved through stochastic search and iterative refinement of the preference-based guidance model. We evaluate PITA across diverse tasks, including mathematical reasoning and sentiment classification, demonstrating its effectiveness in aligning LLM outputs with user preferences.

View on arXiv PDF

Similar