abstract:f1dea9ce909598ea.tex

1: \begin{abstract}

2: In reinforcement learning (RL), offline learning decoupled learning from data collection

3: and is useful in dealing with exploration-exploitation tradeoff and enables data reuse in many applications. In this work, we study two offline learning tasks: policy evaluation and policy learning.

4: For policy evaluation, we formulate it as a stochastic optimization problem and show that it can be solved using  approximate stochastic gradient descent (aSGD) with time-dependent data.

5: We show aSGD

6: achieves $\tilde O(1/t)$ convergence when the loss function is strongly convex and the rate is independent of the discount factor $\gamma$.

7: This result can be extended to include algorithms making approximately contractive iterations such as TD(0).

8: The policy evaluation algorithm is then combined with the policy iteration algorithm to learn the optimal policy. To achieve an $\epsilon$ accuracy, the complexity of the algorithm is $\tilde O(\epsilon^{-2}(1-\gamma)^{-5})$, which matches the complexity bound for classic online RL algorithms such as Q-learning.

9: \end{abstract}

10: