1: \begin{abstract}
2: Parameter-specific adaptive learning rate methods are computationally efficient
3: ways to reduce the ill-conditioning problems encountered when training
4: large deep networks. Following recent work that strongly suggests that most of the
5: critical points encountered when training such networks are saddle points,
6: we find how considering the presence of negative eigenvalues of the Hessian
7: could help us design better suited adaptive learning rate schemes.
8: We show that the popular Jacobi preconditioner has undesirable behavior
9: in the presence of both positive and negative curvature, and present theoretical and
10: empirical evidence that the so-called equilibration preconditioner is comparatively
11: better suited to non-convex problems. We introduce a novel adaptive learning rate scheme,
12: called ESGD, based on the equilibration preconditioner.
13: Our experiments show that ESGD performs as well or better than RMSProp in terms of convergence speed, always clearly improving over plain stochastic gradient descent\footnote{An implementation is freely available at \url{https://gist.github.com/ynd/f1ce7133a03ec54d6eb9}}.
14: \end{abstract}
15: