CLFeb 18, 2025

LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data

Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Shengjie Ma, Aofan Liu, Hui Xiong, Jian Guo

arXiv:2502.12583v218.815 citationsh-index: 12Has CodeACL

Originality Highly original

AI Analysis

This work solves the problem of unreliable synthetic data for researchers and developers building long-context LLMs, though it is incremental as it builds on existing data-centric approaches.

The paper tackles the problem of enhancing long-context reasoning in LLMs by addressing faithfulness issues in synthetic data, resulting in significant performance improvements on multi-hop reasoning datasets and LongBench.

Despite the growing development of long-context large language models (LLMs), data-centric approaches relying on synthetic data have been hindered by issues related to faithfulness, which limit their effectiveness in enhancing model performance on tasks such as long-context reasoning and question answering (QA). These challenges are often exacerbated by misinformation caused by lack of verification, reasoning without attribution, and potential knowledge conflicts. We propose LongFaith, a novel pipeline for synthesizing faithful long-context reasoning instruction datasets. By integrating ground truth and citation-based reasoning prompts, we eliminate distractions and improve the accuracy of reasoning chains, thus mitigating the need for costly verification processes. We open-source two synthesized datasets, LongFaith-SFT and LongFaith-PO, which systematically address multiple dimensions of faithfulness, including verified reasoning, attribution, and contextual grounding. Extensive experiments on multi-hop reasoning datasets and LongBench demonstrate that models fine-tuned on these datasets significantly improve performance. Our ablation studies highlight the scalability and adaptability of the LongFaith pipeline, showcasing its broad applicability in developing long-context LLMs.

View on arXiv PDF Code

Similar