abstract:7722fed444d0c8e5.tex

1: \begin{abstract}

2:

3: % significant

4: Inverse Reinforcement Learning (IRL) is attractive in scenarios where reward engineering can be tedious.

5: However, prior IRL algorithms use on-policy transitions, which require intensive sampling from the current policy for stable and optimal performance.

6: This limits IRL applications in the real world, where environment interactions can become highly expensive.

7: To tackle this problem, we present Off-Policy Inverse Reinforcement Learning (OPIRL), which (1) adopts off-policy data distribution instead of on-policy and enables significant reduction of the number of interactions with the environment, (2) learns a reward function that is transferable with high generalization capabilities on changing dynamics, and (3) leverages mode-covering behavior for faster convergence.

8: We demonstrate that our method is considerably more sample efficient and generalizes to novel environments through the experiments.

9: Our method achieves better or comparable results on policy performance baselines with significantly fewer interactions.

10: Furthermore, we empirically show that the recovered reward function generalizes to different tasks where prior arts are prone to fail.

11: \end{abstract}

12: