abstract:8c981449d940d39c.tex

1: \begin{abstract}

2: As an important algorithm in deep reinforcement learning, advantage actor critic (A2C)

3: has been widely succeeded in both discrete and continuous control tasks with raw pixel inputs, but its sample

4: efficiency still needs to improve more. In traditional reinforcement learning, actor-critic algorithms generally use the recursive

5: least squares (RLS) technology to update the parameter of linear function approximators for accelerating

6: their convergence speed. However, A2C algorithms seldom use this technology to train deep neural

7: networks (DNNs) for improving their sample efficiency. In this paper, we propose two novel RLS-based A2C

8: algorithms and investigate their performance. Both proposed algorithms, called RLSSA2C and RLSNA2C,

9: use the RLS method to train the critic network and the hidden layers of the actor network.

10: The main difference between them is at the policy learning step. RLSSA2C uses an ordinary first-order gradient descent algorithm

11: and the standard policy gradient to learn the policy parameter.

12: RLSNA2C uses the Kronecker-factored approximation, the RLS method  and the natural policy gradient

13: to learn the compatible parameter and the policy parameter. In addition, we analyze the complexity

14: and convergence of both algorithms, and present three tricks for further improving their convergence speed.

15: Finally, we demonstrate the effectiveness of both algorithms on 40 games in the Atari 2600 environment and 11 tasks in the MuJoCo environment.

16: From the experimental results, it is shown that our both algorithms have better sample efficiency than the vanilla

17: A2C on most games or tasks, and have higher computational efficiency than other two

18: state-of-the-art algorithms.

19: \end{abstract}

20: