CL LGOct 14, 2024

How to Leverage Demonstration Data in Alignment for Large Language Model? A Self-Imitation Learning Perspective

Teng Xiao, Mingxiao Li, Yige Yuan, Huaisheng Zhu, Chao Cui, Vasant G Honavar

arXiv:2410.10093v116.229 citationsh-index: 8Has CodeEMNLP

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficiently fine-tuning large language models for alignment, offering a lightweight method that could benefit AI practitioners, though it appears incremental as it builds on existing imitation learning approaches.

The paper tackles the problem of aligning large language models with offline demonstration data by introducing a generalized self-imitation learning (GSIL) framework, which eliminates complex adversarial training and achieves significant performance improvements in benchmarks like HumanEval, GSM8K, and MT-Bench.

This paper introduces a novel generalized self-imitation learning ($\textbf{GSIL}$) framework, which effectively and efficiently aligns large language models with offline demonstration data. We develop $\textbf{GSIL}$ by deriving a surrogate objective of imitation learning with density ratio estimates, facilitating the use of self-generated data and optimizing the imitation learning objective with simple classification losses. $\textbf{GSIL}$ eliminates the need for complex adversarial training in standard imitation learning, achieving lightweight and efficient fine-tuning for large language models. In addition, $\textbf{GSIL}$ encompasses a family of offline losses parameterized by a general class of convex functions for density ratio estimation and enables a unified view for alignment with demonstration data. Extensive experiments show that $\textbf{GSIL}$ consistently and significantly outperforms baselines in many challenging benchmarks, such as coding (HuamnEval), mathematical reasoning (GSM8K) and instruction-following benchmark (MT-Bench).

View on arXiv PDF Code

Similar