1: \begin{abstract}
2: Designing reward functions is a longstanding challenge in reinforcement learning (RL); it requires specialized knowledge or domain data, leading to high costs for development.
3: To address this, we introduce \ourmethod, a data-free framework that automates the generation and shaping of dense reward functions based on large language models (LLMs).
4: Given a goal described in natural language, \ourmethod generates shaped dense reward functions as an executable program grounded in a compact representation of the environment.
5: Unlike inverse RL and recent work that uses LLMs to write sparse reward codes or unshaped dense rewards with a constant function across timesteps, \ourmethod produces interpretable, free-form dense reward codes that cover a wide range of tasks, utilize existing packages, and allow iterative refinement with human feedback.
6: We evaluate \ourmethod on two robotic manipulation benchmarks (\textsc{ManiSkill2}, \textsc{MetaWorld}) and two locomotion environments of \textsc{MuJoCo}.
7: On 13 of the 17 manipulation tasks, policies trained with generated reward codes achieve similar or better task success rates and convergence speed than expert-written reward codes.
8: For locomotion tasks, our method learns six novel locomotion behaviors with a success rate exceeding 94\%.
9: Furthermore, we show that the policies trained in the simulator with our method can be deployed in the real world.
10: Finally, \ourmethod further improves the policies by refining their reward functions with human feedback. Video results are available at
11: \url{https://text-to-reward.github.io/}
12:
13: \end{abstract}
14: