abstract:cbc9c98fd7025f05.tex

1: \begin{abstract}

2: Parameter-specific adaptive learning rate methods are computationally efficient

3: ways to reduce the ill-conditioning problems encountered when training

4: large deep networks. Following recent work that strongly suggests that most of the

5: critical points encountered when training such networks are saddle points,

6: we find how considering the presence of negative eigenvalues of the Hessian

7: could help us design better suited adaptive learning rate schemes.

8: We show that the popular Jacobi preconditioner has undesirable behavior

9: in the presence of both positive and negative curvature, and present theoretical and

10: empirical evidence that the so-called equilibration preconditioner is comparatively

11: better suited to non-convex problems. We introduce a novel adaptive learning rate scheme,

12: called ESGD, based on the equilibration preconditioner.

13: Our experiments show that ESGD performs as well or better than RMSProp in terms of convergence speed, always clearly improving over plain stochastic gradient descent\footnote{An implementation is freely available at \url{https://gist.github.com/ynd/f1ce7133a03ec54d6eb9}}.

14: \end{abstract}

15: