AI LGDec 22, 2025

Learning General Policies with Policy Gradient Methods

Simon Ståhlberg, Blai Bonet, Hector Geffner

arXiv:2512.19366v123.832 citationsh-index: 51KR

Originality Incremental advance

AI Analysis

This addresses the challenge of reliable generalization in reinforcement learning for AI planning domains, though it is incremental by building on existing methods.

The paper tackles the problem of generalization in reinforcement learning by combining combinatorial planning methods with deep reinforcement learning to learn general policies, achieving performance nearly as good as combinatorial approaches while avoiding scalability issues.

While reinforcement learning methods have delivered remarkable results in a number of settings, generalization, i.e., the ability to produce policies that generalize in a reliable and systematic way, has remained a challenge. The problem of generalization has been addressed formally in classical planning where provable correct policies that generalize over all instances of a given domain have been learned using combinatorial methods. The aim of this work is to bring these two research threads together to illuminate the conditions under which (deep) reinforcement learning approaches, and in particular, policy optimization methods, can be used to learn policies that generalize like combinatorial methods do. We draw on lessons learned from previous combinatorial and deep learning approaches, and extend them in a convenient way. From the former, we model policies as state transition classifiers, as (ground) actions are not general and change from instance to instance. From the latter, we use graph neural networks (GNNs) adapted to deal with relational structures for representing value functions over planning states, and in our case, policies. With these ingredients in place, we find that actor-critic methods can be used to learn policies that generalize almost as well as those obtained using combinatorial approaches while avoiding the scalability bottleneck and the use of feature pools. Moreover, the limitations of the DRL methods on the benchmarks considered have little to do with deep learning or reinforcement learning algorithms, and result from the well-understood expressive limitations of GNNs, and the tradeoff between optimality and generalization (general policies cannot be optimal in some domains). Both of these limitations are addressed without changing the basic DRL methods by adding derived predicates and an alternative cost structure to optimize.

View on arXiv PDF

Similar