abstract:e32efcb232abdd49.tex

1: \begin{abstract}

2: % We focus on online imitation learning (IL) where the task is to find a policy that ``imitates'' the behavior of an expert via interaction with the environment. Since most practical IL algorithms have weak or no theoretical guarantees on their performance,

3:

4: % we propose theoretically principled algorithms by exploiting the connection to online convex optimization.

5:

6:

7: % For a sequence of smooth, convex losses determined by the IL problem, we prove that DAGGER (a popular IL method) converges to an $\epsilon$-neighborhood of the expert policy at an $O(1/T)$ rate where $T$ is the number of iterations and $\epsilon$ quantifies the expressivity of the policy. Unlike previous bounds that require strong-convexity, we prove such a result for general convex functions by leveraging the structure in Markov decision processes. To ensure convergence for a wider class of policies, we propose a scalable implementation of the Follow-the-Regularized-Leader (FTRL) algorithm, and its adaptive variants. Similar to FTL, these algorithms are statistically efficient, and can achieve an $O(1/T + \epsilon/\sqrt{T})$ convergence rate. We demonstrate the effectiveness of the proposed algorithms via experiments on synthetic and high-dimensional control tasks.

8: % \end{abstract}

9: