abstract:6288397aa54a3bab.tex

1: \begin{abstract}

2: Offline reinforcement learning suffers from the out-of-distribution issue and extrapolation error. Most policy constraint methods regularize the density of the trained policy towards the behavior policy, which is too restrictive in most cases.

3: We propose Supported Trust Region optimization~(STR)

4: which performs trust region policy optimization with the policy constrained within the support of the behavior policy, enjoying the less restrictive \textit{support constraint}.

5: We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset. Further with both errors incorporated, STR still

6: guarantees safe policy improvement

7: for each step.

8: Empirical results validate the theory of STR and demonstrate its state-of-the-art performance on MuJoCo locomotion domains and much more challenging AntMaze domains.

9: \end{abstract}

10: