1: \begin{abstract}
2: We consider the problem of finding a control policy for a Markov
3: Decision Process (MDP) to maximize the probability of reaching some
4: states while avoiding some other states. This problem is motivated by
5: applications in robotics, where such problems naturally arise when
6: probabilistic models of robot motion are required to satisfy temporal
7: logic task specifications. We transform this problem into a Stochastic
8: Shortest Path (SSP) problem and develop a new approximate dynamic
9: programming algorithm to solve it. This algorithm is of the actor-critic
10: type and uses a least-square temporal difference learning method. It
11: operates on sample paths of the system and optimizes the policy within a
12: pre-specified class parameterized by a parsimonious set of
13: parameters. We show its convergence to a policy corresponding to a
14: stationary point in the parameters' space. Simulation results confirm
15: the effectiveness of the proposed solution.
16: \end{abstract}
17: