1: \begin{abstract}
2: We study a new two-time-scale stochastic gradient method for solving optimization problems,
3: where the gradients are computed with the aid of an auxiliary variable under samples generated by time-varying Markov random processes parameterized by the underlying optimization variable. These time-varying samples make gradient directions in our update biased and dependent, which can potentially lead to the divergence of the iterates. In our two-time-scale approach, one scale is to estimate the true gradient from these samples, which is then used to update the estimate of the optimal solution.
4: While these two iterates are implemented simultaneously, the former is updated ``faster'' (using bigger step sizes) than the latter (using smaller step sizes). Our first contribution is to characterize the finite-time complexity of the proposed two-time-scale stochastic gradient method. In particular, we provide explicit formulas for the convergence rates of this method under different structural assumptions, namely, strong convexity, convexity, the Polyak-\L ojasiewicz (P\L) condition, and general non-convexity.
5:
6:
7:
8: We apply our framework to two problems in control and reinforcement learning.
9: First, we look at the standard online actor-critic algorithm over finite state and action spaces and derive a convergence rate of $\mathcal{O}(k^{-2/5})$, which recovers the best known rate derived specifically for this problem. Second, we study an online actor-critic algorithm for the linear-quadratic regulator and show that a convergence rate of $\mathcal{O}(k^{-2/3})$ is achieved. This is the first time such a result is known in the literature. Finally, we support our theoretical analysis with numerical simulations where the convergence rates are visualized.
10:
11: \end{abstract}
12: