790a01d8bd749cf6.tex
1: \begin{abstract}
2: Upside-Down Reinforcement Learning (UDRL) is an approach for solving RL problems that does not require value functions and uses \emph{only} supervised learning, where the targets for given inputs in a dataset do not change over time \cite{schmidhuber2020reinforcement,srivastava2021training}.
3: Ghosh et al.~\cite{ghosh2020learning} proved that Goal-Conditional Supervised Learning (GCSL)---which can be viewed as a simplified version of UDRL---optimizes a lower bound on goal-reaching performance. 
4: This raises expectations that such algorithms may enjoy guaranteed convergence to the optimal policy in arbitrary environments, similar to certain well-known traditional RL algorithms.
5: Here we show that for a specific \emph{episodic} UDRL algorithm (eUDRL, including GCSL), this is not the case, and give the causes of this limitation.
6: To do so, we first introduce a helpful rewrite of eUDRL as a recursive policy update.
7: This formulation helps to disprove its convergence to the optimal policy for a wide class of stochastic environments.
8: Finally, we provide a concrete example of a very simple environment where eUDRL diverges.
9: Since the primary aim of this paper is to present a negative result, and the best counterexamples are the simplest ones, we restrict all discussions to finite (discrete) environments, ignoring issues of function approximation and limited sample size. 
10: \end{abstract}
11: