abstract:790a01d8bd749cf6.tex

1: \begin{abstract}

2: Upside-Down Reinforcement Learning (UDRL) is an approach for solving RL problems that does not require value functions and uses \emph{only} supervised learning, where the targets for given inputs in a dataset do not change over time \cite{schmidhuber2020reinforcement,srivastava2021training}.

3: Ghosh et al.~\cite{ghosh2020learning} proved that Goal-Conditional Supervised Learning (GCSL)---which can be viewed as a simplified version of UDRL---optimizes a lower bound on goal-reaching performance.

4: This raises expectations that such algorithms may enjoy guaranteed convergence to the optimal policy in arbitrary environments, similar to certain well-known traditional RL algorithms.

5: Here we show that for a specific \emph{episodic} UDRL algorithm (eUDRL, including GCSL), this is not the case, and give the causes of this limitation.

6: To do so, we first introduce a helpful rewrite of eUDRL as a recursive policy update.

7: This formulation helps to disprove its convergence to the optimal policy for a wide class of stochastic environments.

8: Finally, we provide a concrete example of a very simple environment where eUDRL diverges.

9: Since the primary aim of this paper is to present a negative result, and the best counterexamples are the simplest ones, we restrict all discussions to finite (discrete) environments, ignoring issues of function approximation and limited sample size.

10: \end{abstract}

11: