1: \begin{abstract}
2: Offline reinforcement learning suffers from the out-of-distribution issue and extrapolation error. Most policy constraint methods regularize the density of the trained policy towards the behavior policy, which is too restrictive in most cases.
3: We propose Supported Trust Region optimization~(STR)
4: which performs trust region policy optimization with the policy constrained within the support of the behavior policy, enjoying the less restrictive \textit{support constraint}.
5: We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset. Further with both errors incorporated, STR still
6: guarantees safe policy improvement
7: for each step.
8: Empirical results validate the theory of STR and demonstrate its state-of-the-art performance on MuJoCo locomotion domains and much more challenging AntMaze domains.
9: \end{abstract}
10: