abstract:741d027e51657594.tex

1: \begin{abstract}

2: Recently, DeepNorm scales Transformers into extremely deep ({\em i.e.,} 1000 layers) and reveals the promising potential of deep scaling.

3: To stabilize the training of deep models, DeepNorm~\citep{deepnet_2022} attempts to constrain the model update to a constant value.

4: Although applying such a constraint can benefit the early stage of model training, it may lead to undertrained models during the whole training procedure.

5: In this paper, we propose BranchNorm, which dynamically rescales the non-residual branch of Transformer in accordance with the training period.

6: BranchNorm not only theoretically stabilizes the training with smooth gradient norms at the early stage, but also encourages better convergence in the subsequent training stage.

7: Experiment results on multiple translation tasks demonstrate that BranchNorm achieves a better trade-off between training stability and converge performance.

8: % ({\em i.e.,} 0.6 BLEU improvement over DeepNorm on the WMT2014 En-Fr dataset).

9: \end{abstract}

10: