abstract:9377c1eb7d429aa9.tex

1: \begin{abstract}

2: 	Over-parameterization is ubiquitous nowadays in training neural networks

3: 	to benefit both optimization in seeking global optima and generalization

4: 	in reducing prediction error. However, compressive networks are desired

5: 	in many real world applications and direct training of small networks

6: 	may be trapped in local optima. In this paper, instead of pruning

7: 	or distilling over-parameterized models to compressive ones, we

8: 	propose a new approach based on \emph{differential

9: 		inclusions of inverse scale spaces}. Specifically, it generates a family of models

10: 	from simple to complex ones that \textcolor{black}{couples a pair of parameters to simultaneously train over-parameterized deep models and structural sparsity} on weights of fully connected and convolutional layers. Such a differential inclusion scheme has a simple discretization, proposed as

11: 	\textbf{De}ep \textbf{s}tructurally \textbf{s}pl\textbf{i}tting \textbf{L}inearized \textbf{B}regman \textbf{I}teration (DessiLBI), whose

12: 	global convergence analysis in deep learning is established that from any initializations, algorithmic

13: 	iterations converge to a critical point of empirical risks.

14: 	Experimental evidence shows that DessiLBI achieve comparable and even better performance than

15: the competitive optimizers in exploring the structural sparsity of several widely used backbones on the benchmark datasets. Remarkably, with \emph{early stopping},  DessiLBI unveils \textcolor{black}{``\emph{winning tickets}"} in early epochs: the effective sparse structure with comparable test accuracy to fully trained over-parameterized models.

16: \end{abstract}

17: