1: \begin{abstract}
2: We introduce a class of variational actor-critic algorithms based on a variational formulation
3: over both the value function and the policy. The objective function of the variational formulation
4: consists of two parts: one for maximizing the value function and the other for minimizing the
5: Bellman residual. Besides the vanilla gradient descent with both the value function and
6: the policy updates, we propose two variants, the clipping method and the flipping method, in order to
7: speed up the convergence. We also prove that, when the prefactor of the Bellman residual is
8: sufficiently large, the fixed point of the algorithm is close to the optimal policy.
9: \end{abstract}
10: