abstract:a47fc97477c81eb6.tex

1: \begin{abstract}

2: This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL), such that the safety constraint violations are bounded at any point during learning.

3: In a variety of RL applications the safety of the agent is particularly important, e.g. autonomous platforms or robots that work in proximity of humans.

4: %Thus, researchers are paying increasing attention not only to maximise the long-term task-driven reward, but also to damage avoidance.

5: As enforcing safety during training might severely limit the agent's exploration, we propose here a new architecture that handles the trade-off between efficient progress and safety during exploration.

6: As the exploration progresses, we update via Bayesian inference Dirichlet-Categorical models of the transition probabilities of the Markov decision process that describes the environment dynamics. This paper proposes a way to approximate moments of belief about the risk associated to the action selection policy.

7: We construct those approximations, and prove the convergence results.

8: We propose a novel method for leveraging the expectation approximations to derive an approximate bound on the confidence that the risk is below a certain level.

9: This approach can be easily interleaved with RL and we present experimental results to showcase the performance of the overall architecture.

10: \end{abstract}

11: