LG AI CL CVAug 9, 2025

AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance

arXiv:2508.06944v313.06 citationsh-index: 7Has Code

Originality Highly original

AI Analysis

This addresses the challenge of aligning LLM reasoners more effectively, offering a principled solution for researchers and practitioners in AI, though it is incremental in refining existing fine-tuning methods.

The paper tackles the problem of catastrophic forgetting and suboptimal trade-offs in fine-tuning LLMs for reasoning tasks by introducing AMFT, a single-stage algorithm that dynamically balances imitation and exploration using meta-learning, achieving state-of-the-art results on benchmarks like mathematical reasoning and vision-language navigation.

Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of \textbf{implicit rewards}, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce \textbf{Adaptive Meta Fine-Tuning (AMFT)}, a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward. The core of AMFT is a \textbf{meta-gradient adaptive weight controller} that treats the SFT-RL balance as a learnable parameter, dynamically optimizing it to maximize long-term task performance. This forward-looking approach, regularized by policy entropy for stability, autonomously discovers an effective training curriculum. We conduct a comprehensive evaluation on challenging benchmarks spanning mathematical reasoning, abstract visual reasoning (General Points), and vision-language navigation (V-IRL). AMFT consistently establishes a new state-of-the-art and demonstrats superior generalization on out-of-distribution (OOD) tasks. Ablation studies and training dynamic analysis confirm that the meta-learning controller is crucial for AMFT's stability, sample efficiency, and performance, offering a more principled and effective paradigm for LLM alignment. Our codes are open-sourced via https://github.com/hlxtsyj/AMFT.

View on arXiv PDF Code

Similar