1: \begin{abstract}
2: % Despite empirical success, the theory of reinforcement learning with value function approximation remains fundamentally incomplete. Prior work has identified a variety of pathological behaviors that arise in reinforcement learning algorithms that combine approximate on-policy evaluation and greedification. One prominent example is policy oscillation, wherein an algorithm may cycle indefinitely between policies, rather than converging to a fixed point. In this paper we identify a unifying phenomena responsible for policy oscillation and other pathological behaviours. The phenomena can be summarized by saying that these algorithms optimize the evaluation under the distribution of states that arise under the current policy, but may greedify based on the value of states which lie outside this distribution. This means the values used for greedification are unreliable, and can steer the policy in undesirable directions. In addition to policy oscillations, we show that this same phenomena can cause other pathological behaviours including convergence to the worst possible policy, even when the optimal policy is representable in the class of greedy policies. We demonstrate analytically and experimentally that these pathological behaviours can arise both with and without bootstrapping, and with linear function approximation as well as with more complex parameterized functions like neural networks.
3: % \end{abstract}
4: