LG OC MLFeb 15, 2025

Which Features are Best for Successor Features?

arXiv:2502.10790v19.42 citationsh-index: 1

Originality Incremental advance

AI Analysis

This work addresses a foundational issue in reinforcement learning for researchers, providing a theoretical criterion for feature selection in successor features, though it is incremental as it builds on existing Laplacian eigenfunction methods.

The paper tackles the problem of identifying optimal base features for universal successor features in reinforcement learning, which enable zero-shot adaptation to new tasks, and finds that the same optimal features work for three different classes of downstream tasks, derived from the Laplacian operator under specific conditions.

In reinforcement learning, universal successor features (SFs) are a way to provide zero-shot adaptation to new tasks at test time: they provide optimal policies for all downstream reward functions lying in the linear span of a set of base features. But it is unclear what constitutes a good set of base features, that could be useful for a wide set of downstream tasks beyond their linear span. Laplacian eigenfunctions (the eigenfunctions of $Δ+Δ^\ast$ with $Δ$ the Laplacian operator of some reference policy and $Δ^\ast$ that of the time-reversed dynamics) have been argued to play a role, and offer good empirical performance. Here, for the first time, we identify the optimal base features based on an objective criterion of downstream performance, in a non-tautological way without assuming the downstream tasks are linear in the features. We do this for three generic classes of downstream tasks: reaching a random goal state, dense random Gaussian rewards, and random ``scattered'' sparse rewards. The features yielding optimal expected downstream performance turn out to be the \emph{same} for these three task families. They do not coincide with Laplacian eigenfunctions in general, though they can be expressed from $Δ$: in the simplest case (deterministic environment and decay factor $γ$ close to $1$), they are the eigenfunctions of $Δ^{-1}+(Δ^{-1})^\ast$. We obtain these results under an assumption of large behavior cloning regularization with respect to a reference policy, a setting often used for offline RL. Along the way, we get new insights into KL-regularized\option{natural} policy gradient, and into the lack of SF information in the norm of Bellman gaps.

View on arXiv PDF

Similar