2b9a260bb5abaa6f.tex
1: \begin{abstract}%
2:     We analyse quantile temporal-difference learning (QTD), a distributional reinforcement learning algorithm that has proven to be a key component in several successful large-scale applications of reinforcement learning.
3:     Despite these empirical successes, a theoretical understanding of QTD has proven elusive until now.
4:     Unlike classical TD learning, which can be analysed with standard stochastic approximation tools, QTD updates do not approximate contraction mappings, are highly non-linear, and may have multiple fixed points.
5:     The core result of this paper is a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1, putting QTD on firm theoretical footing.
6:     The proof establishes connections between QTD and non-linear differential inclusions through stochastic approximation theory and non-smooth analysis.
7: \end{abstract}
8: