8b1957243e23e090.tex
1: \begin{abstract}
2: The policy gradient theorem (Sutton et al.,\ 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient.\
3: Most algorithms based on this theorem, in practice, break this assumption, introducing a distribution shift that can cause the convergence to poor solutions.\
4: In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy.\
5: The policy gradient calculation in this form can be simplified in terms of a \textsl{gradient critic}, which can be recursively estimated due to a new Bellman equation of gradients.\
6: By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that side-steps the distribution shift issue in a model-free way.\
7: We prove that, under certain realizability conditions, our estimator is unbiased regardless of the sampling strategy.\
8: We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples.
9: The implementation of the experiments can be found at \href{github.com/SamuelePolimi/temporal-difference-gradient}{https://github.com/SamuelePolimi/temporal-difference-gradient}.
10: \end{abstract}
11: