1: \begin{abstract}
2: % Despite empirical success, the theory of reinforcement learning with value function approximation remains fundamentally incomplete. Prior work has identified a variety of counterexamples to illustrate unintuitive behaviours that arise when algorithms that are known to converge to the optimal policy in the tabular case are combined with function approximation. One prominent example is policy oscillation, wherein an algorithm may cycle indefinitely between a number of different policies, rather than converging to any fixed point. We propose a unifying explanation for policy oscillation and other pathological behaviours that arise in algorithms that alternate between approximate on-policy evaluation and greedification. Our explanation can be summarized by saying that the greedification step in such algorithms implicitly depends on the evaluation of states for which the value function is not optimized. This means the values used for greedification are unreliable, and can steer the policy in undesirable directions. In addition to policy oscillations, we show that this same phenomena can cause other pathological behaviours including convergence to the worst possible policy, even when the optimal policy is representable in the class of greedy policies. We demonstrate analytically and experimentally that these pathological behaviours can arise both with and without bootstrapping, and with linear function approximation as well as with more complex parameterized functions like neural networks.
3: % \end{abstract}
4: