LGFeb 5

On the Role of Iterative Computation in Reinforcement Learning

arXiv:2602.05999v21 citationsh-index: 5
Originality Highly original
AI Analysis

This work provides a formal framework and empirical evidence for the benefits of iterative computation in RL policies, which is significant for researchers and practitioners designing more efficient and generalizable RL agents.

This paper investigates how the amount of compute available to a reinforcement learning (RL) policy affects its learning and generalization. The authors propose a minimal architecture that can use a variable amount of compute and demonstrate that it achieves stronger performance and better generalization on longer-horizon tasks across 31 different RL tasks, outperforming standard networks with up to 5 times more parameters.

How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not provide a language to answer these questions formally. Empirically, deep RL policies are often parameterized as neural networks with static architectures, conflating the amount of compute and the number of parameters. In this paper, we formalize compute bounded policies and prove that policies which use more compute can solve problems and generalize to longer-horizon tasks that are outside the scope of policies with less compute. Building on prior work in algorithmic learning and model-free planning, we propose a minimal architecture that can use a variable amount of compute. Our experiments complement our theory. On a set 31 different tasks spanning online and offline RL, we show that $(1)$ this architecture achieves stronger performance simply by using more compute, and $(2)$ stronger generalization on longer-horizon test tasks compared to standard feedforward networks or deep residual network using up to 5 times more parameters.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes