abstract:9569967f71d2c026.tex

1: \begin{abstract}

2: Although Reinforcement Learning (RL) is effective for sequential decision-making problems under uncertainty, it still fails to thrive in real-world systems where \textit{risk} or \textit{safety} is a binding constraint.

3: %As such, the objective of this paper is to present a general method for risk-constrained Markov decision processes (MDPs) where an adversary is learned to represent a probabilistic unsafe region which the agent should avoid.

4: In this paper, we formulate the RL problem with safety constraints as a non-zero-sum game.

5: While deployed with maximum entropy RL, this formulation leads to a safe adversarially guided soft actor-critic framework, called \algo. In \algo, the adversary aims to break the safety constraint while the RL agent aims to maximize the constrained value function given the adversary's policy.

6: The safety constraint on the agent's value function manifests only as a repulsion term between the agent's and the adversary's policies.

7: %First, we provide a minimax convergence analysis of our framework in the case of softmax policies.

8: Unlike previous approaches, \algo can address different safety criteria such as safe exploration, mean-variance risk sensitivity, and CVaR-like coherent risk sensitivity.

9: We illustrate the design of the adversary for these constraints.

10: %(1) learns the best actions to break the constraints, (2) represents the conditional value-at-risk (CVaR) and (3) represents the mean-variance risk sensitive value function.

11: Then, in each of these variations, we show the agent differentiates itself from the adversary's unsafe actions in addition to learning to solve the task. Finally, for challenging continuous control tasks, we demonstrate that \algo achieves faster convergence, better efficiency, and fewer failures to satisfy the safety constraints than risk-averse distributional RL and risk-neutral soft actor-critic algorithms.%, including velocity and acceleration constraints.

12: \end{abstract}

13: