abstract:909e46c8ceaf0d01.tex

1: \begin{abstract}

2: Designing reward functions is a longstanding challenge in reinforcement learning (RL); it requires specialized knowledge or domain data, leading to high costs for development.

3: To address this, we introduce \ourmethod, a data-free framework that automates the generation and shaping of dense reward functions based on large language models (LLMs).

4: Given a goal described in natural language, \ourmethod generates shaped dense reward functions as an executable program grounded in a compact representation of the environment.

5: Unlike inverse RL and recent work that uses LLMs to write sparse reward codes or unshaped dense rewards with a constant function across timesteps, \ourmethod produces interpretable, free-form dense reward codes that cover a wide range of tasks, utilize existing packages, and allow iterative refinement with human feedback.

6: We evaluate \ourmethod on two robotic manipulation benchmarks (\textsc{ManiSkill2}, \textsc{MetaWorld}) and two locomotion environments of \textsc{MuJoCo}.

7: On 13 of the 17 manipulation tasks, policies trained with generated reward codes achieve similar or better task success rates and convergence speed than expert-written reward codes.

8: For locomotion tasks, our method learns six novel locomotion behaviors with a success rate exceeding 94\%.

9: Furthermore, we show that the policies trained in the simulator with our method can be deployed in the real world.

10: Finally, \ourmethod further improves the policies by refining their reward functions with human feedback. Video results are available at

11: \url{https://text-to-reward.github.io/}

12:

13: \end{abstract}

14: