c7b8bcb1685ab026.tex
1: \begin{abstract}
2:     Much of the recent successes in Deep Reinforcement Learning have been based on minimizing the squared Bellman error.
3:     However, training is often unstable due to fast-changing target $Q$-values, and target networks are employed to regularize the $Q$-value estimation and stabilize 
4:     training by using an additional set of lagging parameters. 
5:     Despite their advantages, target networks are potentially an inflexible way to regularize $Q$-values which may ultimately slow down training.
6:     In this work, we address this issue by augmenting the squared Bellman error with a functional regularizer. Unlike target networks, the regularization we propose here is explicit and enables us to use up-to-date parameters as well as control the regularization. This leads to a faster yet more stable training method.
7:     We analyze the convergence of our method theoretically and empirically validate our predictions on simple environments as well as on a suite of Atari environments. We demonstrate empirical improvements over target network based methods in terms of both sample efficiency and performance. In summary, our approach provides a fast and stable alternative to replace the standard squared Bellman error. 
8:     
9:     % We analyze the convergence of our method theoretically in the linear Function Approximation setting and empirically validate our prediction on simple environments as well as on a suite of Atari environments.
10: \end{abstract}
11: