LGCLSep 16, 2025

When Inverse Data Outperforms: Exploring the Pitfalls of Mixed Data in Multi-Stage Fine-Tuning

arXiv:2509.13079v11 citationsh-index: 26EMNLP
Originality Incremental advance
AI Analysis

This addresses a specific issue in AI alignment for bidirectional reasoning, but it is incremental as it builds on existing fine-tuning methods.

The paper tackled the problem of mixed reasoning data weakening performance in multi-stage fine-tuning by constructing a reverse reasoning dataset (r1k) and showing that SFT on it improves accuracy by 1.6%--6.8% over forward data, but mixing data introduces conflicts.

Existing work has shown that o1-level performance can be achieved with limited data distillation, but most existing methods focus on unidirectional supervised fine-tuning (SFT), overlooking the intricate interplay between diverse reasoning patterns. In this paper, we construct r1k, a high-quality reverse reasoning dataset derived by inverting 1,000 forward examples from s1k, and examine how SFT and Direct Preference Optimization (DPO) affect alignment under bidirectional reasoning objectives. SFT on r1k yields a 1.6%--6.8% accuracy improvement over s1k across evaluated benchmarks. However, naively mixing forward and reverse data during SFT weakens the directional distinction. Although DPO can partially recover this distinction, it also suppresses less preferred reasoning paths by shifting the probability mass toward irrelevant outputs. These findings suggest that mixed reasoning data introduce conflicting supervision signals, underscoring the need for robust and direction-aware alignment strategies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes