ad87578b343fb249.tex
1: \begin{abstract}%
2: Preconditioned gradient methods are among the most general and powerful tools
3: in optimization. However, preconditioning requires storing and manipulating
4: prohibitively large matrices. We describe and analyze a new structure-aware
5: preconditioning algorithm, called \NAME, for stochastic optimization over
6: tensor spaces. \NAME maintains a set of preconditioning matrices, each of
7: which operates on a single dimension, contracting over the remaining
8: dimensions. We establish convergence guarantees in the stochastic convex
9: setting, the proof of which builds upon matrix trace inequalities.  Our
10: experiments with state-of-the-art deep learning models show that \NAME is
11: capable of converging considerably faster than commonly used optimizers.
12: Although it involves a more complex update rule, \NAME's runtime per step is
13: comparable to that of simple gradient methods such as SGD, AdaGrad, and Adam.%
14: \end{abstract}
15: