AI CL LGApr 7, 2025

Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use

Anna Goldie, Azalia Mirhoseini, Hao Zhou, Irene Cai, Christopher D. Manning

arXiv:2504.04736v230.138 citationsh-index: 132

Originality Incremental advance

AI Analysis

This addresses the need for better multi-step reasoning in AI systems, offering a novel approach with demonstrated cross-task generalization, though it builds incrementally on existing reinforcement learning techniques.

The paper tackles the problem of improving large language models for complex reasoning and agentic tasks by proposing Step-Wise Reinforcement Learning (SWiRL), a method that generates synthetic data and applies reinforcement learning to multi-step optimization scenarios, resulting in relative accuracy gains of up to 21.5% on tasks like GSM8K and HotPotQA.

Reinforcement learning has been shown to improve the performance of large language models. However, traditional approaches like RLHF or RLAIF treat the problem as single-step. As focus shifts toward more complex reasoning and agentic tasks, language models must take multiple steps of text generation, reasoning and environment interaction before generating a solution. We propose a synthetic data generation and RL methodology targeting multi-step optimization scenarios. This approach, called Step-Wise Reinforcement Learning (SWiRL), iteratively generates multi-step reasoning and tool use data, and then learns from that data. It employs a simple step-wise decomposition that breaks each multi-step trajectory into multiple sub-trajectories corresponding to each action by the original model. It then applies synthetic data filtering and RL optimization on these sub-trajectories. We evaluated SWiRL on a number of multi-step tool use, question answering, and mathematical reasoning tasks. Our experiments show that SWiRL outperforms baseline approaches by 21.5%, 12.3%, 14.8%, 11.1%, and 15.3% in relative accuracy on GSM8K, HotPotQA, CofCA, MuSiQue, and BeerQA, respectively. Excitingly, the approach exhibits generalization across tasks: for example, training only on HotPotQA (text question-answering) improves zero-shot performance on GSM8K (a math dataset) by a relative 16.9%.

View on arXiv PDF

Similar