abstract:8d75b064e2b13250.tex

1: \begin{abstract}

2:   \noindent In reinforcement learning (RL), the consideration of multivariate reward signals has led to fundamental advancements in multi-objective decision-making, transfer learning, and representation learning.

3:   This work introduces the first oracle-free and computationally-tractable algorithms for provably convergent multivariate \emph{distributional} dynamic programming and temporal difference learning.

4:   Our convergence rates match the familiar rates in the scalar reward setting, and additionally provide new insights into the fidelity of approximate return distribution representations as a function of the reward dimension.

5:   Surprisingly, when the reward dimension is larger than $1$, we show that standard analysis of categorical TD learning fails, which we resolve with a novel projection onto the space of mass-$1$ signed measures.

6:   Finally, with the aid of our technical results and simulations, we identify tradeoffs between distribution representations that influence the performance of multivariate distributional RL in practice.

7: \end{abstract}

8: