abstract:9cc33e75a458d196.tex

1: \begin{abstract}

2: Stochastic gradient descent (\textsc{Sgd}) methods are

3: the most powerful optimization tools in training machine learning and deep

4: learning models. Moreover, acceleration (a.k.a. momentum)  methods and

5: diagonal scaling (a.k.a. adaptive gradient) methods are the two main techniques to improve the

6: slow convergence of \textsc{Sgd}. While empirical studies have demonstrated

7: potential advantages of combining these two techniques, it remains unknown whether these methods can achieve the optimal rate of convergence for stochastic optimization.

8: In this paper, we present a new class of adaptive and accelerated stochastic gradient descent methods

9: and show that they exhibit the optimal sampling and iteration complexity for stochastic optimization.

10: More specifically, we show that diagonal scaling, initially designed to improve vanilla stochastic

11: gradient, can be incorporated into accelerated stochastic gradient descent to achieve the optimal

12: rate of convergence for smooth stochastic optimization. We also show that

13: momentum,

14: apart from being known to speed up the convergence rate of deterministic optimization,

15: also provides us new ways of designing non-uniform and aggressive moving average schemes in stochastic optimization.

16: Finally, we present some heuristics that help to implement adaptive accelerated stochastic

17: gradient descent methods and to further improve their practical

18: performance for machine learning and deep learning.

19: \end{abstract}

20: