1: \begin{abstract}
2: Many modern approaches to offline Reinforcement Learning (RL) utilize \emph{behavior regularization}, typically augmenting a model-free actor critic algorithm with a penalty measuring divergence of the policy from the offline data.
3: In this work, we propose an alternative approach to encouraging the learned policy to stay close to the data, namely parameterizing the critic as the $\log$-behavior-policy, which generated the offline data, plus a state-action value offset term, which can be learned using a neural network.
4: Behavior regularization then corresponds to an appropriate regularizer on the offset term.
5: We propose using a gradient penalty regularizer for the offset term and demonstrate its equivalence to Fisher divergence regularization, suggesting connections to the score matching and generative energy-based model literature.
6: We thus term our resulting algorithm Fisher-BRC (Behavior Regularized Critic).
7: On standard offline RL benchmarks, Fisher-BRC achieves both improved performance and faster convergence over existing state-of-the-art methods. \footnote{Code to reproduce our results is available at \url{https://github.com/google-research/google-research/tree/master/fisher_brc}.}
8: \end{abstract}
9: