9377c1eb7d429aa9.tex
1: \begin{abstract}
2: 	Over-parameterization is ubiquitous nowadays in training neural networks
3: 	to benefit both optimization in seeking global optima and generalization
4: 	in reducing prediction error. However, compressive networks are desired
5: 	in many real world applications and direct training of small networks
6: 	may be trapped in local optima. In this paper, instead of pruning
7: 	or distilling over-parameterized models to compressive ones, we
8: 	propose a new approach based on \emph{differential
9: 		inclusions of inverse scale spaces}. Specifically, it generates a family of models
10: 	from simple to complex ones that \textcolor{black}{couples a pair of parameters to simultaneously train over-parameterized deep models and structural sparsity} on weights of fully connected and convolutional layers. Such a differential inclusion scheme has a simple discretization, proposed as 
11: 	\textbf{De}ep \textbf{s}tructurally \textbf{s}pl\textbf{i}tting \textbf{L}inearized \textbf{B}regman \textbf{I}teration (DessiLBI), whose
12: 	global convergence analysis in deep learning is established that from any initializations, algorithmic
13: 	iterations converge to a critical point of empirical risks. 
14: 	Experimental evidence shows that DessiLBI achieve comparable and even better performance than 
15: the competitive optimizers in exploring the structural sparsity of several widely used backbones on the benchmark datasets. Remarkably, with \emph{early stopping},  DessiLBI unveils \textcolor{black}{``\emph{winning tickets}"} in early epochs: the effective sparse structure with comparable test accuracy to fully trained over-parameterized models. 
16: \end{abstract}
17: