68bdc8109898a77f.tex
1: \begin{abstract}
2: In this paper, we introduce \textsc{Apollo}, a quasi-Newton method for nonconvex stochastic optimization, which dynamically incorporates the curvature of the loss function by approximating the Hessian via a diagonal matrix.
3: Importantly, the update and storage of the diagonal approximation of Hessian is as efficient as adaptive first-order optimization methods with linear complexity for both time and memory.
4: To handle nonconvexity, we replace the Hessian with its rectified absolute value, which is guaranteed to be positive-definite.
5: Experiments on three tasks of vision and language show that \textsc{Apollo} achieves significant improvements over other stochastic optimization methods, including SGD and variants of Adam, in terms of both convergence speed and generalization performance.
6: The implementation of the algorithm is available at \url{https://github.com/XuezheMax/apollo}.
7: \end{abstract}
8: