741d027e51657594.tex
1: \begin{abstract}
2: Recently, DeepNorm scales Transformers into extremely deep ({\em i.e.,} 1000 layers) and reveals the promising potential of deep scaling. 
3: To stabilize the training of deep models, DeepNorm~\citep{deepnet_2022} attempts to constrain the model update to a constant value. 
4: Although applying such a constraint can benefit the early stage of model training, it may lead to undertrained models during the whole training procedure. 
5: In this paper, we propose BranchNorm, which dynamically rescales the non-residual branch of Transformer in accordance with the training period. 
6: BranchNorm not only theoretically stabilizes the training with smooth gradient norms at the early stage, but also encourages better convergence in the subsequent training stage. 
7: Experiment results on multiple translation tasks demonstrate that BranchNorm achieves a better trade-off between training stability and converge performance.
8: % ({\em i.e.,} 0.6 BLEU improvement over DeepNorm on the WMT2014 En-Fr dataset).
9: \end{abstract}
10: