1: \begin{abstract}% <- trailing '%' for backward compatibility of .sty file
2: In this paper,
3: we establish the global optimality and convergence rate of an off-policy actor critic algorithm in the tabular setting without using density ratio to correct the discrepancy between the state distribution of the behavior policy and that of the target policy.
4: Our work goes beyond existing works on the optimality of policy gradient methods in that
5: existing works use the exact policy gradient for updating the policy parameters
6: while we use an approximate and stochastic update step.
7: Our update step is not a gradient update because we do not use a density ratio to correct the state distribution,
8: which aligns well with what practitioners do.
9: Our update is approximate because we use a learned critic instead of the true value function.
10: Our update is stochastic because at each step the update is done for only the current state action pair.
11: Moreover,
12: we remove several restrictive assumptions from existing works in our analysis.
13: Central to our work is the finite sample analysis of a generic stochastic approximation algorithm with time-inhomogeneous update operators on time-inhomogeneous Markov chains,
14: based on its uniform contraction properties.
15: \let\svthefootnote\thefootnote
16: \let\thefootnote\relax\footnotetext{$^\dagger$ indicates equal advising.}
17: \let\thefootnote\svthefootnote
18: \end{abstract}
19: