CLAug 6, 2025

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

Noah Ziems, Dilara Soylu, Lakshya A Agrawal, Isaac Miller, Liheng Lai, Chen Qian, Kaiqiang Song, Meng Jiang, Dan Klein, Matei Zaharia, Karel D'Oosterlinck, Christopher Potts

arXiv:2508.04660v12 citationsh-index: 20Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of optimizing modular language model programs for AI researchers and practitioners, representing an incremental advancement by extending GRPO to multi-module settings.

The paper tackled the challenge of applying Group Relative Policy Optimization (GRPO) to modular AI systems with multiple language model calls, resulting in mmGRPO, which improved accuracy by 11% on average across tasks compared to post-trained models and by 5% against prompt optimization alone.

Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how best to leverage GRPO to improve these systems. We begin to address this challenge by defining mmGRPO, a simple multi-module generalization of GRPO that groups LM calls by module across rollouts and handles variable-length and interrupted trajectories. We find that mmGRPO, composed with automatic prompt optimization, improves accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM, and by 5% against prompt optimization on its own. We open-source mmGRPO in DSPy as the dspy.GRPO optimizer.

View on arXiv PDF

Similar