abstract:6cf972d36b4cff7d.tex

1: \begin{abstract}

2: Stochastic methods with coordinate-wise adaptive stepsize (such as

3: RMSprop and

4: Adam)

5: have been widely used in training deep neural networks.

6: Despite their fast convergence,

7: they can generalize worse than stochastic gradient descent.

8: In this paper, by revisiting the design of Adagrad, we propose

9: to split the network parameters into blocks, and

10: use a blockwise

11: adaptive stepsize.

12: Intuitively,

13: blockwise adaptivity is less aggressive than

14: adaptivity to individual coordinates,

15: and can have a better balance between adaptivity and generalization.

16: We show

17: theoretically

18: that the proposed blockwise adaptive gradient

19: descent has comparable convergence rate as its counterpart with

20: coordinate-wise adaptive stepsize, but is faster

21: up to some constant.

22: We also study its uniform stability

23: and show that blockwise adaptivity can lead to lower generalization error than coordinate-wise adaptivity.

24: Experimental results show that blockwise adaptive gradient

25: descent

26: converges

27: faster

28: and improves generalization performance over

29: Nesterov's accelerated gradient and Adam.

30:

31: \end{abstract}