LGMay 7

On Training in Imagination

Nadav Timor, Ravid Shwartz-Ziv, Micah Goldblum, Yann LeCun, David Harel

arXiv:2605.0673258.7

AI Analysis

For researchers in model-based reinforcement learning, this work provides theoretical insights and practical guidelines for allocating computational budget between dynamics and reward samples.

The paper analyzes how errors in learned dynamics and reward models affect returns in model-based RL, deriving optimal sample allocation between dynamics and reward samples to minimize return error, and showing that zero-mean reward noise leaves REINFORCE gradient unbiased while introducing a variance term that decreases with rollouts, leading to a practical tradeoff between number and quality of reward samples.

State-of-the-art model-based reinforcement learning methods train policies on imagined rollouts. These rollouts are trajectories generated by a learned dynamics model and are scored by a learned reward model, but without querying the true environment during policy updates. We study this training paradigm by quantifying how errors in learned dynamics and reward models affect returns and policy optimization. First, we extend the analysis of Asadi et al. (2018) to MDPs with learned reward models, and derive the optimal sample allocation--the ratio of dynamics samples to reward samples that minimizes a bound on return error under power-law scaling assumptions. We identify lower Lipschitz constants of the learned dynamics, reward, and policy as a representation desideratum that tightens this bound, and we connect this perspective to the temporal-straightening objective of Wang et al. (2026). Second, we examine how policy optimization with REINFORCE tolerates noisy rewards, which are often cheaper to obtain. We show that zero-mean reward noise leaves the gradient estimator unbiased and adds at most a variance term that decreases with the number of rollouts. This introduces a practical tradeoff: given a fixed budget, should one buy more rollouts with cheaper but noisier rewards, or fewer rollouts with more expensive but less noisy rewards? We reduce this choice to a one-dimensional optimization problem and characterize the optimum.

View on arXiv PDF

Similar