69ece58dd0a1b78d.tex
1: \begin{abstract}
2: We leverage self-supervised learning to reduce the number of preference labels required to learn a reward function and improve final policy performance. During self-supervised training, environment dynamics are encoded in a state-action representation via a next-state prediction objective, we refer to this form of self-supervised training as self-future consistency (SFC). Our approach targets off-policy reinforcement learning methods and leverages the trajectories stored in the replay buffer to train on the SFC objective. We extend, compare against, and demonstrate improvements over PEBBLE (curret state-of-the-art for preference learning in deep reinforcement learning). Our improvements specifically target policy performance on the ground truth reward and the number of episodes required for policy convergence. With self-future consistency training incorporated into PEBBLE, policies (1) train faster for all amounts of preference feedback and (2) when trained smaller amounts of feedback (e.g. 250 \& 500), are competitive with those trained on larger amounts of feedback (e.g. 1000 \& 2000). Our approach thus improves on one of the main weaknesses of preference-based reward learning methods, namely sample complexity. 
3: \end{abstract}
4: