Optimal oracle inequalities for solving projected fixed-point equations
This research provides a theoretical characterization of the optimal error for solving projected fixed-point equations, which is significant for researchers developing and analyzing stochastic approximation schemes in areas like reinforcement learning and computational methods for differential/integral equations.
This paper investigates methods for solving linear fixed-point equations in Hilbert spaces using random observations and searching over a low-dimensional subspace. It provides instance-dependent upper bounds on the mean-squared error for a linear stochastic approximation scheme with Polyak-Ruppert averaging, showing the bound consists of an approximation error and a statistical error term. The authors also establish lower bounds using information theory, demonstrating that these terms are unimprovable in an instance-dependent sense, and apply these findings to characterize the optimality of temporal difference learning methods.
Linear fixed point equations in Hilbert spaces arise in a variety of settings, including reinforcement learning, and computational methods for solving differential and integral equations. We study methods that use a collection of random observations to compute approximate solutions by searching over a known low-dimensional subspace of the Hilbert space. First, we prove an instance-dependent upper bound on the mean-squared error for a linear stochastic approximation scheme that exploits Polyak--Ruppert averaging. This bound consists of two terms: an approximation error term with an instance-dependent approximation factor, and a statistical error term that captures the instance-specific complexity of the noise when projected onto the low-dimensional subspace. Using information theoretic methods, we also establish lower bounds showing that both of these terms cannot be improved, again in an instance-dependent sense. A concrete consequence of our characterization is that the optimal approximation factor in this problem can be much larger than a universal constant. We show how our results precisely characterize the error of a class of temporal difference learning methods for the policy evaluation problem with linear function approximation, establishing their optimality.