812f6db8cf335178.tex
1: \begin{abstract}
2: Off-policy learning refers to the problem of learning the value function of a way of behaving, or policy, while following a different policy. 
3: Gradient-based off-policy learning algorithms, such as GTD and TDC/GQ~\cite{Sutton09fastgradient}, converge even when using function approximation and incremental updates. 
4: However, they have been developed for the case of a fixed behavior policy. In control problems, one would like to adapt the behavior policy over time to become more greedy with respect to the existing value function. 
5: In this paper, we present the first gradient-based learning algorithms for this problem, which rely on the framework of policy gradient in order to modify the behavior policy. 
6: We present derivations of the algorithms, a convergence theorem, and empirical evidence showing that they compare favorably to existing approaches.
7: \end{abstract}
8: