1: \begin{abstract}
2: In reinforcement learning (RL), offline learning decoupled learning from data collection
3: and is useful in dealing with exploration-exploitation tradeoff and enables data reuse in many applications. In this work, we study two offline learning tasks: policy evaluation and policy learning.
4: For policy evaluation, we formulate it as a stochastic optimization problem and show that it can be solved using approximate stochastic gradient descent (aSGD) with time-dependent data.
5: We show aSGD
6: achieves $\tilde O(1/t)$ convergence when the loss function is strongly convex and the rate is independent of the discount factor $\gamma$.
7: This result can be extended to include algorithms making approximately contractive iterations such as TD(0).
8: The policy evaluation algorithm is then combined with the policy iteration algorithm to learn the optimal policy. To achieve an $\epsilon$ accuracy, the complexity of the algorithm is $\tilde O(\epsilon^{-2}(1-\gamma)^{-5})$, which matches the complexity bound for classic online RL algorithms such as Q-learning.
9: \end{abstract}
10: