LGAIJun 11, 2021

Preferential Temporal Difference Learning

arXiv:2106.06508v211 citations
AI Analysis

This is an incremental improvement for reinforcement learning practitioners, addressing inefficiencies in value estimation by incorporating state-specific weights.

The paper tackles the problem of improving Temporal-Difference (TD) learning by re-weighting states in updates based on importance or reliability, rather than just visitation, and demonstrates convergence with linear function approximation and better empirical performance compared to other TD methods.

Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to find good policies. Generally speaking, TD learning updates states whenever they are visited. When the agent lands in a state, its value can be used to compute the TD-error, which is then propagated to other states. However, it may be interesting, when computing updates, to take into account other information than whether a state is visited or not. For example, some states might be more important than others (such as states which are frequently seen in a successful trajectory). Or, some states might have unreliable value estimates (for example, due to partial observability or lack of data), making their values less desirable as targets. We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes