How to Leverage Demonstration Data in Alignment for Large Language Model? A Self-Imitation Learning Perspective
This work addresses the challenge of efficiently fine-tuning large language models for alignment, offering a lightweight method that could benefit AI practitioners, though it appears incremental as it builds on existing imitation learning approaches.
The paper tackles the problem of aligning large language models with offline demonstration data by introducing a generalized self-imitation learning (GSIL) framework, which eliminates complex adversarial training and achieves significant performance improvements in benchmarks like HumanEval, GSM8K, and MT-Bench.
This paper introduces a novel generalized self-imitation learning ($\textbf{GSIL}$) framework, which effectively and efficiently aligns large language models with offline demonstration data. We develop $\textbf{GSIL}$ by deriving a surrogate objective of imitation learning with density ratio estimates, facilitating the use of self-generated data and optimizing the imitation learning objective with simple classification losses. $\textbf{GSIL}$ eliminates the need for complex adversarial training in standard imitation learning, achieving lightweight and efficient fine-tuning for large language models. In addition, $\textbf{GSIL}$ encompasses a family of offline losses parameterized by a general class of convex functions for density ratio estimation and enables a unified view for alignment with demonstration data. Extensive experiments show that $\textbf{GSIL}$ consistently and significantly outperforms baselines in many challenging benchmarks, such as coding (HuamnEval), mathematical reasoning (GSM8K) and instruction-following benchmark (MT-Bench).