abstract:8b1957243e23e090.tex

1: \begin{abstract}

2: The policy gradient theorem (Sutton et al.,\ 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient.\

3: Most algorithms based on this theorem, in practice, break this assumption, introducing a distribution shift that can cause the convergence to poor solutions.\

4: In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy.\

5: The policy gradient calculation in this form can be simplified in terms of a \textsl{gradient critic}, which can be recursively estimated due to a new Bellman equation of gradients.\

6: By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that side-steps the distribution shift issue in a model-free way.\

7: We prove that, under certain realizability conditions, our estimator is unbiased regardless of the sampling strategy.\

8: We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples.

9: The implementation of the experiments can be found at \href{github.com/SamuelePolimi/temporal-difference-gradient}{https://github.com/SamuelePolimi/temporal-difference-gradient}.

10: \end{abstract}

11: