c66b4d107f5dfc9a.tex
1: \begin{abstract}
2:   This work investigates the formal policy synthesis of continuous-state stochastic dynamic systems given high-level specifications in linear temporal logic.
3:   To learn an optimal policy that maximizes the satisfaction probability, we take a product between a dynamic system and the translated automaton to construct a product system on which we solve an optimal planning problem.
4:   Since this product system has a hybrid product state space that results in reward sparsity, we introduce a generalized optimal backup order, in reverse to the topological order, to guide the value backups and accelerate the learning process.
5:   We provide the optimality proof for using the generalized optimal backup order in this optimal planning problem.
6:   Further, this paper presents an actor-critic reinforcement learning algorithm when topological order applies. This algorithm leverages advanced mathematical techniques and enjoys the property of hyperparameter self-tuning.
7:   We provide proof of the optimality and convergence of our proposed reinforcement learning algorithm. We use neural networks to approximate the value function and policy function for hybrid product state space.
8:   Furthermore, we observe that assigning integer numbers to automaton states can rank the value or policy function approximated by neural networks.
9:   To break the ordinal relationship, we use an individual neural network for each automaton state's value (policy) function, termed modular learning.
10:   We conduct two experiments. First, to show the efficacy of our reinforcement learning algorithm, we compare it with baselines on a classic control task, CartPole.
11:   Second, we demonstrate the empirical performance of our formal policy synthesis framework on motion planning of a Dubins car with a temporal specification.
12: \end{abstract}
13: