1: \begin{abstract}
2: Over-parameterization is ubiquitous nowadays in training neural networks
3: to benefit both optimization in seeking global optima and generalization
4: in reducing prediction error. However, compressive networks are desired
5: in many real world applications and direct training of small networks
6: may be trapped in local optima. In this paper, instead of pruning
7: or distilling over-parameterized models to compressive ones, we
8: propose a new approach based on \emph{differential
9: inclusions of inverse scale spaces}. Specifically, it generates a family of models
10: from simple to complex ones that \textcolor{black}{couples a pair of parameters to simultaneously train over-parameterized deep models and structural sparsity} on weights of fully connected and convolutional layers. Such a differential inclusion scheme has a simple discretization, proposed as
11: \textbf{De}ep \textbf{s}tructurally \textbf{s}pl\textbf{i}tting \textbf{L}inearized \textbf{B}regman \textbf{I}teration (DessiLBI), whose
12: global convergence analysis in deep learning is established that from any initializations, algorithmic
13: iterations converge to a critical point of empirical risks.
14: Experimental evidence shows that DessiLBI achieve comparable and even better performance than
15: the competitive optimizers in exploring the structural sparsity of several widely used backbones on the benchmark datasets. Remarkably, with \emph{early stopping}, DessiLBI unveils \textcolor{black}{``\emph{winning tickets}"} in early epochs: the effective sparse structure with comparable test accuracy to fully trained over-parameterized models.
16: \end{abstract}
17: