9cc33e75a458d196.tex
1: \begin{abstract}
2: Stochastic gradient descent (\textsc{Sgd}) methods are
3: the most powerful optimization tools in training machine learning and deep
4: learning models. Moreover, acceleration (a.k.a. momentum)  methods and 
5: diagonal scaling (a.k.a. adaptive gradient) methods are the two main techniques to improve the 
6: slow convergence of \textsc{Sgd}. While empirical studies have demonstrated 
7: potential advantages of combining these two techniques, it remains unknown whether these methods can achieve the optimal rate of convergence for stochastic optimization.
8: In this paper, we present a new class of adaptive and accelerated stochastic gradient descent methods
9: and show that they exhibit the optimal sampling and iteration complexity for stochastic optimization.
10: More specifically, we show that diagonal scaling, initially designed to improve vanilla stochastic
11: gradient, can be incorporated into accelerated stochastic gradient descent to achieve the optimal
12: rate of convergence for smooth stochastic optimization. We also show that 
13: momentum, 
14: apart from being known to speed up the convergence rate of deterministic optimization, 
15: also provides us new ways of designing non-uniform and aggressive moving average schemes in stochastic optimization. 
16: Finally, we present some heuristics that help to implement adaptive accelerated stochastic
17: gradient descent methods and to further improve their practical 
18: performance for machine learning and deep learning. 
19: \end{abstract}
20: