ce13e77389206ec0.tex
1: \begin{abstract}
2: Deep Actor-Critic algorithms, which combine Actor-Critic with deep
3: neural network (DNN), have been among the most prevalent reinforcement
4: learning algorithms for decision-making problems in simulated environments.
5: However, the existing deep Actor-Critic algorithms are still not mature
6: to solve realistic problems with non-convex stochastic constraints
7: and high cost to interact with the environment. In this paper, we
8: propose a single-loop deep Actor-Critic (SLDAC) algorithmic framework
9: for general constrained reinforcement learning (CRL) problems. In
10: the actor step, the constrained stochastic successive convex approximation
11: (CSSCA) method is applied to handle the non-convex stochastic objective
12: and constraints. In the critic step, the critic DNNs are only updated
13: once or a few finite times for each iteration, which simplifies the
14: algorithm to a single-loop framework (the existing works require a
15: sufficient number of updates for the critic step to ensure a good
16: enough convergence of the inner loop for each iteration). Moreover,
17: the variance of the policy gradient estimation is reduced by reusing
18: observations from the old policy. The single-loop design and the observation
19: reuse effectively reduce the agent-environment interaction cost and
20: computational complexity. In spite of the biased policy gradient estimation
21: incurred by the single-loop design and observation reuse, we prove
22: that the SLDAC with a feasible initial point can converge to a Karush-Kuhn-Tuker
23: (KKT) point of the original problem almost surely. Simulations show
24: that the SLDAC algorithm can achieve superior performance with much
25: lower interaction cost.
26: \end{abstract}