CLDec 30, 2024

Plug-and-Play Training Framework for Preference Optimization

Peking U
arXiv:2412.20996v12 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses a limitation in preference optimization for large language models, specifically for tasks with high accuracy requirements like mathematical reasoning, but it is incremental as it builds on existing methods.

The paper tackles the problem of preference optimization methods failing to account for varying difficulty levels in training samples, particularly in mathematical reasoning tasks, by proposing a plug-and-play training framework that assigns weights to samples based on output distributions, resulting in consistent improvements in these tasks.

Recently, preference optimization methods such as DPO have significantly enhanced large language models (LLMs) in wide tasks including dialogue and question-answering. However, current methods fail to account for the varying difficulty levels of training samples during preference optimization, leading to mediocre performance in tasks with high accuracy requirements, particularly in mathematical reasoning. To address this limitation, we propose a novel training framework, which employs multiple sampling to analyze output distributions, assign different weights to samples, and incorporate these weights into the preference optimization process. This plug-and-play approach enables LLMs to prioritize challenging examples during training, improving learning efficiency. Experimental results demonstrate that our framework integrates seamlessly with various preference optimization methods and achieves consistent improvements in mathematical reasoning tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes