1: \begin{abstract}
2: A novel reinforcement learning scheme to synthesize policies for continuous-space Markov decision processes (MDPs) is proposed.
3: This scheme enables one to apply model-free, off-the-shelf reinforcement
4: learning algorithms for finite MDPs to compute optimal strategies for the
5: corresponding continuous-space MDPs without explicitly constructing the
6: finite-state abstraction.
7: The proposed approach is based on abstracting the system with a finite MDP (without constructing it explicitly) with
8: \emph{unknown} transition probabilities, synthesizing strategies over the abstract
9: MDP, and then mapping the results back over the concrete continuous-space MDP
10: with \emph{approximate optimality guarantees}.
11: The properties of interest for the system belong to a fragment of linear temporal logic,
12: known as syntactically co-safe linear temporal logic (scLTL), and the synthesis
13: requirement is to maximize the probability of satisfaction within a given
14: bounded time horizon.
15: A key contribution of the paper is to leverage the classical convergence results for reinforcement learning on
16: finite MDPs and provide control strategies maximizing the probability
17: of satisfaction over unknown, continuous-space MDPs while providing probabilistic closeness
18: guarantees.
19: Automata-based reward functions are often sparse; we present a novel
20: potential-based reward shaping technique to produce dense rewards to speed up learning.
21: The effectiveness of the proposed approach is demonstrated by
22: applying it to three physical benchmarks concerning the regulation of
23: a room's temperature, control of a road traffic cell, and of a $7$-dimensional nonlinear model of a BMW $320$i car.
24: \end{abstract}