21dc2e12feab02d6.tex
1: \begin{abstract}
2: Recent control algorithms for Markov decision processes (MDPs) have been designed using an implicit analogy with well-established optimization algorithms. 
3: In this paper, we review this analogy across four problem classes with a unified solution characterization allowing for a systematic transformation of algorithms from one domain to the other. 
4: In particular, we identify equivalent optimization and control algorithms that have already been pointed out in the existing literature, but mostly in a scattered way. 
5: With this unifying framework in mind, we adopt the quasi-Newton method from convex optimization to introduce a novel control algorithm coined as quasi-policy iteration (QPI). 
6: In particular, QPI is based on a novel approximation of the ``Hessian'' matrix in the policy iteration algorithm by exploiting two linear structural constraints specific to MDPs and by allowing for the incorporation of prior information on the transition probability kernel. 
7: While the proposed algorithm has the same computational complexity as value iteration, it interestingly exhibits an empirical convergence behavior similar to policy iteration with a very low sensitivity to the discount factor.
8: 
9: \smallskip
10: \noindent \textsc{Keywords:} Dynamic programming, reinforcement learning, optimization algorithms, quasi-Newton methods, Markov decision processes.
11: 
12: \end{abstract}
13: