On the Importance of Multistability for Horizon Generalization in Reinforcement Learning
Provides a theoretical framework and necessary condition (multistability) for horizon generalization, highlighting a fundamental limitation of current parallelizable RNN architectures for long-horizon RL.
The paper formalizes temporal horizon generalization in RL, proving that multistability is necessary for a policy to remain optimal across all horizons, and that modern parallelizable RNNs (state space models, gated linear RNNs) fail because they are monostable. In simple tasks, multistability is sufficient; complex tasks also require transient dynamics.
In reinforcement learning (RL), agents acting in partially observable Markov decision processes (POMDPs) must rely on memory, typically encoded in a recurrent neural network (RNN), to integrate information from past observations. Long-horizon POMDPs, in which the relevant observation and the optimal action are separated by many time steps (called the horizon), are particularly challenging: training suffers from poor generalization, severe sample inefficiency, and prohibitive exploration costs. Ideally, an agent trained on short horizons would retain optimal behavior at arbitrarily longer ones, but no formal framework currently characterizes when this is achievable. To fill this gap, we formalized temporal horizon generalization, the property that a policy remains optimal for all horizons, derived a necessary and sufficient condition for it, and experimentally evaluated the ability of nonlinear and parallelizable RNN variants to achieve it. This paper presents the resulting theoretical framework, the empirical evaluation, and the dynamical interpretation linking RNN behavior to temporal horizon generalization. Our analyses reveal that multistability is necessary for temporal horizon generalization and, in simple tasks, sufficient; more complex tasks further require transient dynamics. In contrast, modern parallelizable architectures, namely state space models and gated linear RNNs, are monostable by construction and consequently fail to generalize across temporal horizons. We conclude that multistability and transient dynamics are two essential and complementary dynamical regimes for horizon generalization, and that no current parallelizable RNN exhibits both. Designing parallelizable architectures that combine these regimes thus emerges as a key direction for scalable long-horizon RL.