abstract:948e37bd1ec76e41.tex

1: \begin{abstract}

2: We propose policy gradient algorithms which learn risk-sensitive policies in a reinforcement learning (RL) framework.

3: Our proposed algorithms maximize the distortion risk measure (DRM) of the cumulative reward in an episodic Markov decision process in on-policy as well as off-policy RL settings. We derive a variant of the policy gradient theorem that caters to the DRM objective, and use this theorem in conjunction with a likelihood ratio-based gradient estimation scheme. We derive non-asymptotic bounds that establish the convergence of our proposed algorithms to an approximate stationary point of the DRM objective.%

4: \end{abstract}