abstract:812f6db8cf335178.tex

1: \begin{abstract}

2: Off-policy learning refers to the problem of learning the value function of a way of behaving, or policy, while following a different policy.

3: Gradient-based off-policy learning algorithms, such as GTD and TDC/GQ~\cite{Sutton09fastgradient}, converge even when using function approximation and incremental updates.

4: However, they have been developed for the case of a fixed behavior policy. In control problems, one would like to adapt the behavior policy over time to become more greedy with respect to the existing value function.

5: In this paper, we present the first gradient-based learning algorithms for this problem, which rely on the framework of policy gradient in order to modify the behavior policy.

6: We present derivations of the algorithms, a convergence theorem, and empirical evidence showing that they compare favorably to existing approaches.

7: \end{abstract}

8: