eb4b827011989fba.tex
1: \begin{abstract}
2:   We introduce a class of variational actor-critic algorithms based on a variational formulation
3:   over both the value function and the policy. The objective function of the variational formulation
4:   consists of two parts: one for maximizing the value function and the other for minimizing the
5:   Bellman residual. Besides the vanilla gradient descent with both the value function and
6:   the policy updates, we propose two variants, the clipping method and the flipping method, in order to
7:   speed up the convergence. We also prove that, when the prefactor of the Bellman residual is
8:   sufficiently large, the fixed point of the algorithm is close to the optimal policy.
9: \end{abstract}
10: