LGMay 13, 2025

InfoPO: On Mutual Information Maximization for Large Language Model Alignment

Teng Xiao, Zhen Ge, Sujay Sanghavi, Tian Wang, Julian Katz-Samuels, Marc Versage, Qingjun Cui, Trishul Chilimbi

arXiv:2505.08507v113 citationsh-index: 43NAACL

Originality Incremental advance

AI Analysis

This addresses the challenge of efficiently aligning LLMs for improved performance in reasoning-heavy applications, representing an incremental improvement over existing preference optimization methods.

The paper tackles the problem of aligning large language models with human preferences by proposing InfoPO, a method that eliminates reliance on the Bradley-Terry model to prevent overfitting and suboptimal performance. The result shows that InfoPO consistently outperforms established baselines on open benchmarks, especially in reasoning tasks.

We study the post-training of large language models (LLMs) with human preference data. Recently, direct preference optimization and its variants have shown considerable promise in aligning language models, eliminating the need for reward models and online sampling. Despite these benefits, these methods rely on explicit assumptions about the Bradley-Terry (BT) model, which makes them prone to overfitting and results in suboptimal performance, particularly on reasoning-heavy tasks. To address these challenges, we propose a principled preference fine-tuning algorithm called InfoPO, which effectively and efficiently aligns large language models using preference data. InfoPO eliminates the reliance on the BT model and prevents the likelihood of the chosen response from decreasing. Extensive experiments confirm that InfoPO consistently outperforms established baselines on widely used open benchmarks, particularly in reasoning tasks.

View on arXiv PDF

Similar