1: \begin{abstract}
2: Stochastic gradient descent (\textsc{Sgd}) methods are
3: the most powerful optimization tools in training machine learning and deep
4: learning models. Moreover, acceleration (a.k.a. momentum) methods and
5: diagonal scaling (a.k.a. adaptive gradient) methods are the two main techniques to improve the
6: slow convergence of \textsc{Sgd}. While empirical studies have demonstrated
7: potential advantages of combining these two techniques, it remains unknown whether these methods can achieve the optimal rate of convergence for stochastic optimization.
8: In this paper, we present a new class of adaptive and accelerated stochastic gradient descent methods
9: and show that they exhibit the optimal sampling and iteration complexity for stochastic optimization.
10: More specifically, we show that diagonal scaling, initially designed to improve vanilla stochastic
11: gradient, can be incorporated into accelerated stochastic gradient descent to achieve the optimal
12: rate of convergence for smooth stochastic optimization. We also show that
13: momentum,
14: apart from being known to speed up the convergence rate of deterministic optimization,
15: also provides us new ways of designing non-uniform and aggressive moving average schemes in stochastic optimization.
16: Finally, we present some heuristics that help to implement adaptive accelerated stochastic
17: gradient descent methods and to further improve their practical
18: performance for machine learning and deep learning.
19: \end{abstract}
20: