abstract:eb4b827011989fba.tex

1: \begin{abstract}

2:   We introduce a class of variational actor-critic algorithms based on a variational formulation

3:   over both the value function and the policy. The objective function of the variational formulation

4:   consists of two parts: one for maximizing the value function and the other for minimizing the

5:   Bellman residual. Besides the vanilla gradient descent with both the value function and

6:   the policy updates, we propose two variants, the clipping method and the flipping method, in order to

7:   speed up the convergence. We also prove that, when the prefactor of the Bellman residual is

8:   sufficiently large, the fixed point of the algorithm is close to the optimal policy.

9: \end{abstract}

10: