CL LGMay 29

Scaling Multi-Hop Training Data via Graph-Constrained Path Selection

Pengyu Chen, Yonggang Zhang, Mingming Chen, Jun Song, Wei Xue, Yike Guo

arXiv:2605.3123831.6Has Code

Predicted impact top 7% in CL · last 90 daysOriginality Highly original

AI Analysis

This work provides a method for scaling multi-hop training data generation, which is crucial for endowing large language models with compositional reasoning over specialized documents, especially in domains with complex, templated text.

This paper addresses the challenge of creating large-scale multi-hop training data for LLMs from unannotated text, particularly in specialized corpora with repetitive structures. It proposes decoupling evidence path discovery from question-answer verbalization, using a graph-constrained path selection method that expands the usable corpus by 4.4x. This approach improved closed-book Token F1 from 21.66% to 38.58% on the CUAD legal contract corpus.

Endowing large language models with compositional reasoning over specialized documents requires multi-hop training data at scale, where such data rarely exists outside of curated benchmarks built on structured sources. To construct it directly from plain, unannotated text, existing methods ask a single teacher model to jointly discover an evidence path through a document and verbalize it as a question-answer pair. However, these methods degrade sharply when documents are structured around repetitive templates and densely cross-referencing clauses, conditions that characterize most real-world specialized corpora. In this work, we decouple the two operations: reasoning paths are enumerated offline over a graph of contextual keyword centroids, and the teacher is invoked only to verbalize pre-validated paths. The graph enforces five geometric admissibility constraints, for which we provide Gram-matrix arguments establishing that local similarity bounds alone admit endpoint drift up to ${\sim}91^{\circ}$, and that an upper similarity bound is necessary to exit dense embedding cliques formed by boilerplate text. A matched-size ablation isolates the mechanism: at equal training scale, constrained and unconstrained chains yield indistinguishable downstream performance, and the gain at full scale comes from a 4.4$\times$ expansion of the usable corpus rather than from higher per-chain quality -- reframing the role of graph constraints, in this setting, as raising teacher synthesizability rather than improving chain content. Fine-tuning Qwen3-32B on 80K examples constructed from the CUAD legal contract corpus improves closed-book Token F1 from 21.66% to 38.58%. We have released our codes at https://github.com/hkgai-official/GCSCS.

View on arXiv PDF Code

Similar