d7f41ccb4e9e8189.tex
1: \begin{abstract}
2: % Leanring general purpose agent without exernal reward is useful
3: 
4: % To enable robots to a diverse set of user-specified goals during test time, the robot must be able to learn skills without an external reward function. 
5: % 
6: To perform robot manipulation tasks, a low-dimensional state of the environment typically needs to be estimated. However, designing a state estimator can sometimes be difficult, especially in environments with deformable objects. An alternative is to learn an end-to-end policy that maps directly from high-dimensional sensor inputs to actions. However, if this policy is trained with reinforcement learning, then without a state estimator, it is hard to specify a reward function based on high-dimensional observations. To meet this challenge, we propose a simple indicator reward function for goal-conditioned reinforcement learning: we only give a positive reward when the robot's observation exactly matches a target goal observation. We show that by relabeling the original goal with the achieved goal to obtain positive rewards~\cite{andrychowicz2017hindsight}, we can learn with the indicator reward function even in continuous state spaces. We propose two methods to further speed up convergence with indicator rewards: reward balancing and reward filtering.  We show comparable performance between our method and an oracle which uses the ground-truth state for computing rewards.  We show that our method can perform complex tasks in continuous state spaces such as rope manipulation from RGB-D images, without knowledge of the ground-truth state.
7: \end{abstract}
8: