abstract:07be3bc18dc2a6ff.tex

1: \begin{abstract}

2: \looseness-1

3: Reinforcement learning algorithms discover policies that maximize

4: reward, but do not necessarily guarantee safety during learning or execution pha\-ses.  We introduce a new approach to learn optimal policies while enforcing properties expressed in temporal logic.

5: To this end, given the temporal logic specification that is to be obeyed by the learning system, we propose to synthesize a reactive system  called a \emph{shield}.

6: %

7: The shield is introduced in the traditional learning process in two

8: alternative ways, depending on the location at which the shield is implemented.

9: In the first one, the shield acts each time the learning agent is about to make a decision and provides a list of safe actions.

10: In the second way, the shield is introduced after the learning agent. The shield monitors the

11: actions from the learner and corrects them only if the chosen action causes a violation of the specification.

12: %

13: We discuss which requirements a shield must meet to preserve the convergence guarantees of the learner.

14: Finally, we demonstrate the versatility of our approach on several challenging

15: reinforcement learning scenarios.

16: \end{abstract}

17: