CLAIJun 9, 2025

Synthesis by Design: Controlled Data Generation via Structural Guidance

arXiv:2506.07664v21 citationsh-index: 7Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of low-quality data generation for mathematical reasoning in LLMs, offering a dataset and method that could benefit researchers, though it appears incremental as it builds on existing synthesis approaches.

The paper tackles the challenge of improving LLM mathematical reasoning by proposing a method to generate high-quality datasets with structural guidance, producing 39K problems and a 6.1K-problem benchmark, and showing that model performance declines with longer reasoning steps.

Mathematical reasoning remains challenging for LLMs due to complex logic and the need for precise computation. Existing methods enhance LLM reasoning by synthesizing datasets through problem rephrasing, but face issues with generation quality and problem complexity. To address this, we propose to extract structural information with generated problem-solving code from mathematical reasoning and guide data generation with structured solutions. Applied to MATH and GSM8K, our approach produces 39K problems with labeled intermediate steps and a 6.1K-problem benchmark of higher difficulty. Results on our benchmark show that model performance declines as reasoning length increases. Additionally, we conducted fine-tuning experiments using the proposed training data on a range of LLMs, and the results validate the effectiveness of our dataset. We hope the proposed method and dataset will contribute to future research in enhancing LLM reasoning capabilities. Our code and data are available at https://github.com/OpenCausaLab/StructuralGeneration.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes