abstract:2b2a110ba14c24f2.tex

1: \begin{abstract}

2: The inputs and preferences of human users are important considerations in situations where these users interact with autonomous cyber or cyber-physical systems.

3: In these scenarios, one is often interested in aligning behaviors of the system with the preferences of one or more human users.

4: Cumulative prospect theory (CPT) is a paradigm that has been empirically shown to model a tendency of humans to view gains and losses differently.

5: In this paper, we consider a setting where an autonomous agent has to learn behaviors in an unknown environment.

6: In traditional reinforcement learning, these behaviors are learned through repeated interactions with the environment by optimizing an expected utility.

7: In order to endow the agent with the ability to closely mimic the behavior of human users, we optimize a CPT-based cost.

8: We introduce the notion of the CPT-value of an action taken in a state, and establish the convergence of an iterative dynamic programming-based approach to estimate this quantity.

9: We develop two algorithms to enable agents to learn policies to optimize the CPT-value, and evaluate these algorithms in environments where a target state has to be reached while avoiding obstacles.

10: We demonstrate that behaviors of the agent learned using these algorithms are better aligned with that of a human user who might be placed in the same environment, and is significantly improved over a baseline that optimizes an expected utility.

11: \end{abstract}

12: