LGCLMay 6

Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

arXiv:2605.0492079.8
Predicted impact top 15% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For NLP models struggling with compositional generalization, this paper demonstrates that RL can outperform supervised fine-tuning, offering a new training paradigm to address a known bottleneck.

This work shows that outcome-level reinforcement learning, specifically using Group Relative Policy Optimization, improves compositional generalization over supervised fine-tuning on multiple benchmarks, with RL models reducing overfitting to frequent compositions and improving performance on complex composition types.

Compositional generalization refers to correctly interpret novel combinations of known primitives, which remains a major challenge. Existing approaches often rely on supervised fine-tuning, which encourages models to imitate target outputs. This token-level training paradigm fails to capture the global compositional structure required for generalizing to unseen combinations. In this work, we investigate whether compositional generalization can instead be improved through outcome-level reinforcement learning. We adopt Group Relative Policy Optimization to optimize models based on feedback on their final outputs. Within this framework, we explore both a simple binary outcome reward and a composite reward that provides additional composition feedback. Experiments on multiple compositional benchmarks show that reinforcement learning improves compositional generalization compared to supervised fine-tuning. Further analysis reveals that supervised models tend to overfit frequent training compositions, whereas reinforcement learning improves compositional generalization by reshaping the output distribution, particularly for more complex composition types.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes