1: \begin{abstract}
2: \looseness-1
3: Reinforcement learning algorithms discover policies that maximize
4: reward, but do not necessarily guarantee safety during learning or execution pha\-ses. We introduce a new approach to learn optimal policies while enforcing properties expressed in temporal logic.
5: To this end, given the temporal logic specification that is to be obeyed by the learning system, we propose to synthesize a reactive system called a \emph{shield}.
6: %
7: The shield is introduced in the traditional learning process in two
8: alternative ways, depending on the location at which the shield is implemented.
9: In the first one, the shield acts each time the learning agent is about to make a decision and provides a list of safe actions.
10: In the second way, the shield is introduced after the learning agent. The shield monitors the
11: actions from the learner and corrects them only if the chosen action causes a violation of the specification.
12: %
13: We discuss which requirements a shield must meet to preserve the convergence guarantees of the learner.
14: Finally, we demonstrate the versatility of our approach on several challenging
15: reinforcement learning scenarios.
16: \end{abstract}
17: