c14d22b0c9e7eea1.tex
1: \begin{abstract}
2:   In this work we consider the problem of policy optimization in the context of reinforcement learning. In order to avoid discretization, we select the optimal policy to be a continuous function belonging to a reproducing Kernel Hilbert Space (RKHS) which maximizes a expected  cumulative reward (ECR). We design a policy gradient  algorithm (PGA) in this context, deriving the gradients of the functional ECR and  learning the unknown state transition probabilities on the way. In particular, we propose an unbiased stochastic approximation for the gradient that requires a finite number of steps. This unbiased estimator is the key enabler for a novel stochastic PGA, which provably converges to a critical point of the ECR. However, the RKHS approach increases the model order per iteration by adding extra kernels, which may render the numerical complexity prohibitive. To overcome this limitation, we prune the kernel dictionary using an orthogonal matching pursuit procedure, and  prove that the modified method keeps the model order bounded for all iterations, while ensuring convergence to a neighborhood of the critical point.
3: \end{abstract}
4: