1: \begin{abstract}
2: Cumulative prospect theory (CPT) is known to model human decisions well, with substantial empirical evidence supporting this claim.
3: CPT works by distorting probabilities and is more general than the classic expected utility and coherent risk measures. We bring this idea to a risk-sensitive reinforcement learning (RL) setting and design algorithms for both estimation and control.
4: The RL setting presents two particular challenges when CPT is applied: estimating the CPT objective requires estimations of the {\it entire distribution} of the value function and finding a {\it randomized} optimal policy.
5: The estimation scheme that we propose uses the empirical distribution to estimate the CPT-value of a random variable. We then use this scheme in the inner loop of a CPT-value optimization procedure that is based on the well-known simulation optimization idea of simultaneous perturbation stochastic approximation (SPSA).
6: We provide theoretical convergence guarantees for all the proposed algorithms and also
7: illustrate the usefulness of CPT-based criteria in a traffic signal control application.
8: %empirically demonstrate the usefulness of our algorithms.
9: \end{abstract}
10: