abstract:21dc2e12feab02d6.tex

1: \begin{abstract}

2: Recent control algorithms for Markov decision processes (MDPs) have been designed using an implicit analogy with well-established optimization algorithms.

3: In this paper, we review this analogy across four problem classes with a unified solution characterization allowing for a systematic transformation of algorithms from one domain to the other.

4: In particular, we identify equivalent optimization and control algorithms that have already been pointed out in the existing literature, but mostly in a scattered way.

5: With this unifying framework in mind, we adopt the quasi-Newton method from convex optimization to introduce a novel control algorithm coined as quasi-policy iteration (QPI).

6: In particular, QPI is based on a novel approximation of the ``Hessian'' matrix in the policy iteration algorithm by exploiting two linear structural constraints specific to MDPs and by allowing for the incorporation of prior information on the transition probability kernel.

7: While the proposed algorithm has the same computational complexity as value iteration, it interestingly exhibits an empirical convergence behavior similar to policy iteration with a very low sensitivity to the discount factor.

8:

9: \smallskip

10: \noindent \textsc{Keywords:} Dynamic programming, reinforcement learning, optimization algorithms, quasi-Newton methods, Markov decision processes.

11:

12: \end{abstract}

13: