1: \begin{abstract}
2: Stochastic methods with coordinate-wise adaptive stepsize (such as
3: RMSprop and
4: Adam)
5: have been widely used in training deep neural networks.
6: Despite their fast convergence,
7: they can generalize worse than stochastic gradient descent.
8: In this paper, by revisiting the design of Adagrad, we propose
9: to split the network parameters into blocks, and
10: use a blockwise
11: adaptive stepsize.
12: Intuitively,
13: blockwise adaptivity is less aggressive than
14: adaptivity to individual coordinates,
15: and can have a better balance between adaptivity and generalization.
16: We show
17: theoretically
18: that the proposed blockwise adaptive gradient
19: descent has comparable convergence rate as its counterpart with
20: coordinate-wise adaptive stepsize, but is faster
21: up to some constant.
22: We also study its uniform stability
23: and show that blockwise adaptivity can lead to lower generalization error than coordinate-wise adaptivity.
24: Experimental results show that blockwise adaptive gradient
25: descent
26: converges
27: faster
28: and improves generalization performance over
29: Nesterov's accelerated gradient and Adam.
30:
31: \end{abstract}