Explaining Off-Policy Actor-Critic From A Bias-Variance Perspective
This provides theoretical insights for reinforcement learning practitioners, though it is incremental as it builds on existing algorithms.
The paper tackled the problem of explaining off-policy actor-critic algorithms by decomposing policy evaluation error into bias and variance components, showing that emphasizing recent experience and 1/age weighted sampling reduce bias and variance compared to uniform sampling.
Off-policy Actor-Critic algorithms have demonstrated phenomenal experimental performance but still require better explanations. To this end, we show its policy evaluation error on the distribution of transitions decomposes into: a Bellman error, a bias from policy mismatch, and a variance term from sampling. By comparing the magnitude of bias and variance, we explain the success of the Emphasizing Recent Experience sampling and 1/age weighted sampling. Both sampling strategies yield smaller bias and variance and are hence preferable to uniform sampling.