48a53e4504576383.tex
1: \begin{abstract}
2: 	
3: 
4: The paper introduces the first formulation of convex Q-learning for Markov decision processes with function approximation.   The algorithms and theory rest on a relaxation of a dual of Manne's celebrated linear programming characterization of optimal control.
5: The main contributions firstly concern properties of the relaxation, described as a deterministic convex program:  we identify conditions for a bounded solution,  and a significant relationship between the solution to the new convex program, and the solution to standard Q-learning.   The second set of contributions concern algorithm design and analysis:    
6: (i) A direct model-free method for approximating the convex program for Q-learning shares properties with its ideal.   In particular, a bounded solution is ensured subject to a simple property of the basis functions;  (ii)  The proposed algorithms are convergent and new techniques are introduced to obtain the rate of convergence in a mean-square sense;   
7: (iii) The approach can be generalized to a range of performance criteria, and it is found that variance can be reduced by considering ``relative'' dynamic programming equations;  
8: (iv) The theory is illustrated with an application to a classical inventory control problem.
9: 
10: 
11: \noindent
12: This is an extended version of an article to appear in the forthcoming IEEE Conference on Decision and Control.
13: 
14: 
15: 
16: \end{abstract}
17: