abstract:c8b993bf8e7b4a19.tex

1: \begin{abstract}%   <- trailing '%' for backward compatibility of .sty file

2:   In this paper,

3:   we establish the global optimality and convergence rate of an off-policy actor critic algorithm in the tabular setting without using density ratio to correct the discrepancy between the state distribution of the behavior policy and that of the target policy.

4:   Our work goes beyond existing works on the optimality of policy gradient methods in that

5:   existing works use the exact policy gradient for updating the policy parameters

6:   while we use an approximate and stochastic update step.

7:   Our update step is not a gradient update because we do not use a density ratio to correct the state distribution,

8:   which aligns well with what practitioners do.

9:   Our update is approximate because we use a learned critic instead of the true value function.

10:   Our update is stochastic because at each step the update is done for only the current state action pair.

11:   Moreover,

12:   we remove several restrictive assumptions from existing works in our analysis.

13:   Central to our work is the finite sample analysis of a generic stochastic approximation algorithm with time-inhomogeneous update operators on time-inhomogeneous Markov chains,

14:   based on its uniform contraction properties.

15:   \let\svthefootnote\thefootnote

16:   \let\thefootnote\relax\footnotetext{$^\dagger$ indicates equal advising.}

17:   \let\thefootnote\svthefootnote

18: \end{abstract}

19: