1: \begin{abstract}
2: As an important algorithm in deep reinforcement learning, advantage actor critic (A2C)
3: has been widely succeeded in both discrete and continuous control tasks with raw pixel inputs, but its sample
4: efficiency still needs to improve more. In traditional reinforcement learning, actor-critic algorithms generally use the recursive
5: least squares (RLS) technology to update the parameter of linear function approximators for accelerating
6: their convergence speed. However, A2C algorithms seldom use this technology to train deep neural
7: networks (DNNs) for improving their sample efficiency. In this paper, we propose two novel RLS-based A2C
8: algorithms and investigate their performance. Both proposed algorithms, called RLSSA2C and RLSNA2C,
9: use the RLS method to train the critic network and the hidden layers of the actor network.
10: The main difference between them is at the policy learning step. RLSSA2C uses an ordinary first-order gradient descent algorithm
11: and the standard policy gradient to learn the policy parameter.
12: RLSNA2C uses the Kronecker-factored approximation, the RLS method and the natural policy gradient
13: to learn the compatible parameter and the policy parameter. In addition, we analyze the complexity
14: and convergence of both algorithms, and present three tricks for further improving their convergence speed.
15: Finally, we demonstrate the effectiveness of both algorithms on 40 games in the Atari 2600 environment and 11 tasks in the MuJoCo environment.
16: From the experimental results, it is shown that our both algorithms have better sample efficiency than the vanilla
17: A2C on most games or tasks, and have higher computational efficiency than other two
18: state-of-the-art algorithms.
19: \end{abstract}
20: