abstract:e8c48226b4225e16.tex

1: \begin{abstract}

2:

3: Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight matrix.

4: However, the computational overhead introduced by LayerNorm makes these improvements expensive and significantly slows the underlying network, e.g. RNN in particular.

5: In this paper, we hypothesize that re-centering invariance in LayerNorm is dispensable and propose root mean square layer normalization, or \textit{RMSNorm}. RMSNorm regularizes the summed inputs to a neuron in one layer according to root mean square (RMS), giving the model re-scaling invariance property and implicit learning rate adaptation ability.

6: RMSNorm is computationally simpler and thus more efficient than LayerNorm.

7: We also present partial RMSNorm, or \textit{$p$RMSNorm} where the RMS is estimated from $p$\% of the summed inputs without breaking the above properties.

8: Extensive experiments on several tasks using diverse network architectures show that RMSNorm achieves comparable performance against LayerNorm but reduces the running time by 7\%$\sim$64\% on different models. Source code is available at \url{https://github.com/bzhangGo/rmsnorm}.

9: \end{abstract}

10: