abstract:81e33a7a3bd18dce.tex

1: \begin{abstract}

2: Statistical inference with finite-sample validity for the value function of a given policy in Markov decision processes (MDPs) is crucial for ensuring the reliability of reinforcement learning.

3: Temporal Difference (TD) learning, arguably the most widely used algorithm for policy evaluation, serves as a natural framework for this purpose.

4: In this paper, we study the consistency properties of TD learning with Polyak-Ruppert averaging and linear function approximation, and obtain three significant improvements over existing results.

5: First, we derive a novel sharp high-dimensional probability convergence guarantee that depends explicitly on the asymptotic variance and holds under weak conditions. We further establish refined high-dimensional  Berry-Esseen bounds over the class of convex sets that guarantee faster rates than those in the literature.

6: Finally, we propose a plug-in estimator for the asymptotic covariance matrix, designed for efficient online computation.

7: These results enable the construction of confidence regions and simultaneous confidence intervals for the linear parameters of the value function, with guaranteed finite-sample coverage.

8: We demonstrate the applicability of our theoretical findings through numerical experiments.

9: \end{abstract}

10: