5b7f2e6a953d0e1e.tex
1: \begin{abstract}
2: In this paper, we study the convergence properties of off-policy policy optimization algorithms with state-action density ratio correction under function approximation setting, where the objective function is formulated as a max-max-min problem.
3: We first clearly characterize the bias of the learning objective, and then present two strategies with finite-time convergence guarantees.
4: In our first strategy, we propose an algorithm called P-SREDA with convergence rate $O(\epsilon^{-3})$, whose dependency on $\epsilon$ is optimal.
5: Besides, in our second strategy, we design a new off-policy actor-critic style algorithm named O-SPIM. We prove that O-SPIM converges to a stationary point with total complexity $O(\epsilon^{-4})$, which matches the convergence rate of some recent actor-critic algorithms under on-policy setting.
6: \end{abstract}
7: