abstract:2b9a260bb5abaa6f.tex

1: \begin{abstract}%

2:     We analyse quantile temporal-difference learning (QTD), a distributional reinforcement learning algorithm that has proven to be a key component in several successful large-scale applications of reinforcement learning.

3:     Despite these empirical successes, a theoretical understanding of QTD has proven elusive until now.

4:     Unlike classical TD learning, which can be analysed with standard stochastic approximation tools, QTD updates do not approximate contraction mappings, are highly non-linear, and may have multiple fixed points.

5:     The core result of this paper is a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1, putting QTD on firm theoretical footing.

6:     The proof establishes connections between QTD and non-linear differential inclusions through stochastic approximation theory and non-smooth analysis.

7: \end{abstract}

8: