1: \begin{abstract}
2:
3: % significant
4: Inverse Reinforcement Learning (IRL) is attractive in scenarios where reward engineering can be tedious.
5: However, prior IRL algorithms use on-policy transitions, which require intensive sampling from the current policy for stable and optimal performance.
6: This limits IRL applications in the real world, where environment interactions can become highly expensive.
7: To tackle this problem, we present Off-Policy Inverse Reinforcement Learning (OPIRL), which (1) adopts off-policy data distribution instead of on-policy and enables significant reduction of the number of interactions with the environment, (2) learns a reward function that is transferable with high generalization capabilities on changing dynamics, and (3) leverages mode-covering behavior for faster convergence.
8: We demonstrate that our method is considerably more sample efficient and generalizes to novel environments through the experiments.
9: Our method achieves better or comparable results on policy performance baselines with significantly fewer interactions.
10: Furthermore, we empirically show that the recovered reward function generalizes to different tasks where prior arts are prone to fail.
11: \end{abstract}
12: