abstract:93e6ca5e21bfaf4e.tex

1: \begin{abstract}

2:

3: ResNet structure has achieved great empirical success since its debut. Recent work established the convergence of learning over-parameterized ResNet with a scaling factor $\tau=1/L$ on the residual branch where $L$ is the network depth. However, it is not clear how learning ResNet behaves for other values of $\tau$.  In this paper, we fully characterize the convergence theory of gradient descent for learning over-parameterized ResNet with different values of $\tau$. Specifically, with hiding logarithmic factor and constant coefficients, we show that for $\tau\le 1/\sqrt{L}$ gradient descent is guaranteed to converge to the global minma, and especially when $\tau\le 1/L$ the convergence is irrelevant of the network depth. Conversely, we show that for $\tau>L^{-\frac{1}{2}+c}$, the forward output grows at least with rate $L^c$ in expectation and then the learning fails because of gradient explosion for large $L$. %explodes with high probability for large $L$.

4: This means the bound $\tau\le 1/\sqrt{L}$ is sharp for learning ResNet with arbitrary depth. To the best of our knowledge, this is the first work that studies learning ResNet with full range of $\tau$. % Our experiments corroborate the theoretical findings nicely.

5:

6: \end{abstract}

7: