1: \begin{abstract}
2: This work investigates the formal policy synthesis of continuous-state stochastic dynamic systems given high-level specifications in linear temporal logic.
3: To learn an optimal policy that maximizes the satisfaction probability, we take a product between a dynamic system and the translated automaton to construct a product system on which we solve an optimal planning problem.
4: Since this product system has a hybrid product state space that results in reward sparsity, we introduce a generalized optimal backup order, in reverse to the topological order, to guide the value backups and accelerate the learning process.
5: We provide the optimality proof for using the generalized optimal backup order in this optimal planning problem.
6: Further, this paper presents an actor-critic reinforcement learning algorithm when topological order applies. This algorithm leverages advanced mathematical techniques and enjoys the property of hyperparameter self-tuning.
7: We provide proof of the optimality and convergence of our proposed reinforcement learning algorithm. We use neural networks to approximate the value function and policy function for hybrid product state space.
8: Furthermore, we observe that assigning integer numbers to automaton states can rank the value or policy function approximated by neural networks.
9: To break the ordinal relationship, we use an individual neural network for each automaton state's value (policy) function, termed modular learning.
10: We conduct two experiments. First, to show the efficacy of our reinforcement learning algorithm, we compare it with baselines on a classic control task, CartPole.
11: Second, we demonstrate the empirical performance of our formal policy synthesis framework on motion planning of a Dubins car with a temporal specification.
12: \end{abstract}
13: