Generative Temporal Difference Learning for Infinite-Horizon Prediction
This work addresses a fundamental challenge in reinforcement learning for agents needing to plan over extended horizons, offering a hybrid approach that bridges model-free and model-based methods, though it is incremental in building on existing concepts like the successor representation.
The paper tackles the problem of long-term environment prediction in reinforcement learning by introducing the $\\gamma$-model, a predictive model with an infinite probabilistic horizon, which generalizes model-based control procedures and is trained using a generative reinterpretation of temporal difference learning, showing utility in prediction and control tasks.
We introduce the $γ$-model, a predictive model of environment dynamics with an infinite probabilistic horizon. Replacing standard single-step models with $γ$-models leads to generalizations of the procedures central to model-based control, including the model rollout and model-based value estimation. The $γ$-model, trained with a generative reinterpretation of temporal difference learning, is a natural continuous analogue of the successor representation and a hybrid between model-free and model-based mechanisms. Like a value function, it contains information about the long-term future; like a standard predictive model, it is independent of task reward. We instantiate the $γ$-model as both a generative adversarial network and normalizing flow, discuss how its training reflects an inescapable tradeoff between training-time and testing-time compounding errors, and empirically investigate its utility for prediction and control.