abstract:e2d8cde13db1a731.tex

1: \begin{abstract}

2: Due to the nature of risk management in learning applicable policies, risk-sensitive reinforcement learning (RSRL) has been realized as an important direction.

3: RSRL is usually achieved by learning risk-sensitive objectives characterized by various risk measures, under the framework of distributional reinforcement learning.

4: However, it remains unclear if the distributional Bellman operator properly optimizes the RSRL objective in the sense

5: of risk measures.

6: In this paper, we prove that the existing RSRL methods do not achieve unbiased optimization and can not guarantee optimality or even improvements regarding risk measures over accumulated return distributions.

7: To remedy this issue, we further propose a novel algorithm, namely Trajectory Q-Learning (TQL), for RSRL problems with provable convergence to the optimal policy.

8: Based on our new learning architecture, we are free to introduce a general and practical implementation for different risk measures to learn disparate risk-sensitive policies.

9: In the experiments, we verify the learnability of our algorithm and show how our method effectively achieves better performances toward risk-sensitive objectives.

10: \end{abstract}

11: