2b2a110ba14c24f2.tex
1: \begin{abstract}
2: The inputs and preferences of human users are important considerations in situations where these users interact with autonomous cyber or cyber-physical systems. 
3: In these scenarios, one is often interested in aligning behaviors of the system with the preferences of one or more human users. 
4: Cumulative prospect theory (CPT) is a paradigm that has been empirically shown to model a tendency of humans to view gains and losses differently. 
5: In this paper, we consider a setting where an autonomous agent has to learn behaviors in an unknown environment. 
6: In traditional reinforcement learning, these behaviors are learned through repeated interactions with the environment by optimizing an expected utility. 
7: In order to endow the agent with the ability to closely mimic the behavior of human users, we optimize a CPT-based cost. 
8: We introduce the notion of the CPT-value of an action taken in a state, and establish the convergence of an iterative dynamic programming-based approach to estimate this quantity. 
9: We develop two algorithms to enable agents to learn policies to optimize the CPT-value, and evaluate these algorithms in environments where a target state has to be reached while avoiding obstacles. 
10: We demonstrate that behaviors of the agent learned using these algorithms are better aligned with that of a human user who might be placed in the same environment, and is significantly improved over a baseline that optimizes an expected utility. 
11: \end{abstract}
12: