6cf972d36b4cff7d.tex
1: \begin{abstract} 
2: Stochastic methods with coordinate-wise adaptive stepsize (such as
3: RMSprop and 
4: Adam)
5: have been widely used in training deep neural networks. 
6: Despite their fast convergence, 
7: they can generalize worse than stochastic gradient descent. 
8: In this paper, by revisiting the design of Adagrad, we propose
9: to split the network parameters into blocks, and
10: use a blockwise
11: adaptive stepsize.
12: Intuitively, 
13: blockwise adaptivity is less aggressive than
14: adaptivity to individual coordinates, 
15: and can have a better balance between adaptivity and generalization. 
16: We show 
17: theoretically 
18: that the proposed blockwise adaptive gradient
19: descent has comparable convergence rate as its counterpart with
20: coordinate-wise adaptive stepsize, but is faster
21: up to some constant. 
22: We also study its uniform stability 
23: and show that blockwise adaptivity can lead to lower generalization error than coordinate-wise adaptivity. 
24: Experimental results show that blockwise adaptive gradient
25: descent
26: converges
27: faster 
28: and improves generalization performance over
29: Nesterov's accelerated gradient and Adam.
30:  
31: \end{abstract}