abstract:9a2f40ec49acaeec.tex

1: \begin{abstract}

2:

3: %In the world of deep learning, learning model works as an orchestra. Each layer works as a single player contributing to the success of the final model. Each layer has a certain job of extracting some aspect of information from the data set. Normally, scientists use loss function to force each layer evaluating to desired state. However, the situation cannot be avoided in which some layer may work through useless path. Such waste can be compensated by more training epochs. The computation and time cost maybe huge. So here, we introduce a trigger layer, just the mirror of the target layer to help the training of original model.

4:

5: The minibatch stochastic gradient descent method (SGD) is widely applied in deep learning due to its efficiency and scalability that enable training deep networks with a large volume of data. Particularly in the distributed setting, SGD is usually applied with a large batch size. However, as opposed to small-batch SGD, neural network models trained with large-batch SGD can hardly generalize well, i.e., the validation accuracy is low. In this work, we introduce a novel regularization technique, namely distinctive regularization (DReg), which replicates a certain layer of the deep network and encourages the parameters of both layers to be diverse. The DReg technique introduces very little computation overhead. Moreover, we empirically show that optimizing the neural network with DReg using large-batch SGD achieves a significant boost in the convergence and an improved generalization performance. We also demonstrate that DReg can boost the convergence of large-batch SGD with momentum. We believe that DReg can be used as a simple regularization trick to accelerate large-batch training in deep learning.

6: %a significant boost from DReg for training with momentum and plain SGD, where we also observe a boosted momentum.

7: \end{abstract}

8: