abstract:d56d2c49e9732e7d.tex

1: \begin{abstract}

2: Many modern approaches to offline Reinforcement Learning (RL) utilize \emph{behavior regularization}, typically augmenting a model-free actor critic algorithm with a penalty measuring divergence of the policy from the offline data.

3: In this work, we propose an alternative approach to encouraging the learned policy to stay close to the data, namely parameterizing the critic as the $\log$-behavior-policy, which generated the offline data, plus a state-action value offset term, which can be learned using a neural network.

4: Behavior regularization then corresponds to an appropriate regularizer on the offset term.

5: We propose using a gradient penalty regularizer for the offset term and demonstrate its equivalence to Fisher divergence regularization, suggesting connections to the score matching and generative energy-based model literature.

6: We thus term our resulting algorithm Fisher-BRC (Behavior Regularized Critic).

7: On standard offline RL benchmarks, Fisher-BRC achieves both improved performance and faster convergence over existing state-of-the-art methods.  \footnote{Code to reproduce our results is available at \url{https://github.com/google-research/google-research/tree/master/fisher_brc}.}

8: \end{abstract}

9: