Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs
This work addresses the problem of optimizing modular language model programs for AI researchers and practitioners, representing an incremental advancement by extending GRPO to multi-module settings.
The paper tackled the challenge of applying Group Relative Policy Optimization (GRPO) to modular AI systems with multiple language model calls, resulting in mmGRPO, which improved accuracy by 11% on average across tasks compared to post-trained models and by 5% against prompt optimization alone.
Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how best to leverage GRPO to improve these systems. We begin to address this challenge by defining mmGRPO, a simple multi-module generalization of GRPO that groups LM calls by module across rollouts and handles variable-length and interrupted trajectories. We find that mmGRPO, composed with automatic prompt optimization, improves accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM, and by 5% against prompt optimization on its own. We open-source mmGRPO in DSPy as the dspy.GRPO optimizer.