CLLGFeb 8, 2025

Evolving LLMs' Self-Refinement Capability via Synergistic Training-Inference Optimization

arXiv:2502.05605v69 citationsh-index: 12
Originality Highly original
AI Analysis

This addresses the challenge of enabling LLMs to self-improve their responses, which is incremental but with strong specific gains.

The paper tackles the problem that large language models lack inherent self-refinement capability, proposing EVOLVE, a framework that synergistically optimizes training and inference to evolve this ability. The result shows that the evolved Llama-3.1-8B model surpasses GPT-4o with win rates of 62.3% on AlpacaEval 2 and 50.3% on Arena-Hard, and improves performance on mathematical reasoning benchmarks.

Self-Refinement refers to a model's ability to revise its own responses to produce improved outputs. This capability can also serve as a fundamental mechanism for Self-Improvement, for example, by reconstructing datasets with refined results to enhance intrinsic model performance. However, our comprehensive experiments reveal that large language models (LLMs) show no clear evidence of inherent Self-Refinement and may even experience response quality degradation after Self-Refinement. To address this issue, we propose EVOLVE, a simple and effective framework for eliciting and tracking the evolution of Self-Refinement through iterative training. We first explore optimization methods during training to activate the model's Self-Refinement capability. Then, at inference, we investigate various generation strategies to further enhance and utilize Self-Refinement while supplying the necessary data for training. Through synergistic optimization of training and inference stages, we continually evolve the model's Self-Refinement ability, enabling it to better refine its own responses. Moreover, we demonstrate the potential of leveraging Self-Refinement to achieve broader Self-Improvement of intrinsic model abilities. Experiments show that the evolved Self-Refinement ability enables the Llama-3.1-8B base model to surpass GPT-4o, achieving 62.3% length-controlled and 63.3% raw win rates on AlpacaEval 2, and 50.3% on Arena-Hard. It also generalizes effectively to out-of-domain reasoning tasks, improving performance on mathematical reasoning benchmarks such as GSM8K and MATH.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes