abstract:aacb619b7de409eb.tex

1: \begin{abstract}

2: A novel reinforcement learning scheme to synthesize policies for continuous-space Markov decision processes (MDPs) is proposed.

3: This scheme enables one to apply model-free, off-the-shelf reinforcement

4: learning algorithms for finite MDPs to compute optimal strategies for the

5: corresponding continuous-space MDPs without explicitly constructing the

6: finite-state abstraction.

7: The proposed approach is based on abstracting the system with a finite MDP (without constructing it explicitly) with

8: \emph{unknown} transition probabilities, synthesizing strategies over the abstract

9: MDP, and then mapping the results back over the concrete continuous-space MDP

10: with \emph{approximate optimality guarantees}.

11: The properties of interest for the system belong to a fragment of linear temporal logic,

12: known as syntactically co-safe linear temporal logic (scLTL), and the synthesis

13: requirement is to maximize the probability of satisfaction within a given

14: bounded time horizon.

15: A key contribution of the paper is to leverage the classical convergence results for reinforcement learning on

16: finite MDPs and provide control strategies maximizing the probability

17: of satisfaction over unknown, continuous-space MDPs while providing probabilistic closeness

18: guarantees.

19: Automata-based reward functions are often sparse; we present a novel

20: potential-based reward shaping technique to produce dense rewards to speed up learning.

21: The effectiveness of the proposed approach is demonstrated by

22: applying it to three physical benchmarks concerning the regulation of

23: a room's temperature, control of a road traffic cell, and of a $7$-dimensional nonlinear model of a BMW $320$i car.

24: \end{abstract}