abstract:1c9b69215a63b9c4.tex

1: \begin{abstract}

2: Stochastic Gradient Descent (SGD) is the central workhorse for training modern CNNs. Although giving impressive empirical performance it can be slow to converge. In this paper we explore a novel strategy for training a CNN using an alternation strategy that offers substantial speedups during training. We make the following contributions: (i) replace the ReLU non-linearity within a CNN with positive hard-thresholding, (ii) re-interpret this non-linearity as a binary state vector making the entire CNN linear if the multi-layer support is known, and (iii) demonstrate that under certain conditions a global optima to the CNN can be found through local descent. We then employ a novel alternation strategy

3: (between weights and support) for CNN training that leads to

4: substantially faster convergence rates, nice theoretical properties,

5: and achieving state of the art results across large scale datasets

6: (e.g. ImageNet) as well as other standard benchmarks.

7: \end{abstract}

8: