c8b993bf8e7b4a19.tex
1: \begin{abstract}%   <- trailing '%' for backward compatibility of .sty file
2:   In this paper,
3:   we establish the global optimality and convergence rate of an off-policy actor critic algorithm in the tabular setting without using density ratio to correct the discrepancy between the state distribution of the behavior policy and that of the target policy.
4:   Our work goes beyond existing works on the optimality of policy gradient methods in that
5:   existing works use the exact policy gradient for updating the policy parameters 
6:   while we use an approximate and stochastic update step.
7:   Our update step is not a gradient update because we do not use a density ratio to correct the state distribution,
8:   which aligns well with what practitioners do.
9:   Our update is approximate because we use a learned critic instead of the true value function.
10:   Our update is stochastic because at each step the update is done for only the current state action pair.
11:   Moreover,
12:   we remove several restrictive assumptions from existing works in our analysis.
13:   Central to our work is the finite sample analysis of a generic stochastic approximation algorithm with time-inhomogeneous update operators on time-inhomogeneous Markov chains,
14:   based on its uniform contraction properties.
15:   \let\svthefootnote\thefootnote
16:   \let\thefootnote\relax\footnotetext{$^\dagger$ indicates equal advising.}
17:   \let\thefootnote\svthefootnote
18: \end{abstract}
19: