abstract:d85a2ec12445333f.tex

1: \begin{abstract}

2:     Reinforcement learning lacks a principled measure of optimality, causing research to rely on algorithm-to-algorithm or baselines comparisons with no certificate of optimality.

3:     Focusing on finite state and action Markov decision processes (MDP), we develop a simple, computable gap function that provides both upper and lower bounds on the optimality gap.

4:     Therefore, convergence of the gap function is a stronger mode of convergence than convergence of the optimality gap, and it is equivalent to a new notion we call distribution-free convergence, where convergence is independent of any problem-dependent distribution.

5:     We show the basic policy mirror descent exhibits fast distribution-free convergence for both the deterministic and stochastic setting.

6:     We leverage the distribution-free convergence to a uncover a couple new results.

7:     First, the deterministic policy mirror descent can solve unregularized MDPs in strongly-polynomial time.

8:     Second, accuracy estimates can be obtained with no additional samples while running stochastic policy mirror descent and can be used as a termination criteria, which can be verified in the validation step.

9:     % We propose a post-processing, validation step to verify that accuracy of a single policy as well.

10: \end{abstract}

11: