1: \begin{abstract}
2: Recurrent Neural Networks (RNNs) are powerful models for sequential data that
3: have the potential to learn long-term dependencies. However, they are computationally
4: expensive to train and difficult to parallelize. Recent work has shown that
5: normalizing intermediate representations of neural networks can significantly
6: improve convergence rates in feedforward neural networks \cite{ioffe2015}. In
7: particular, batch normalization, which uses mini-batch statistics to standardize
8: features, was shown to significantly reduce training time. In this paper, we
9: show that applying batch normalization to the hidden-to-hidden transitions of our
10: RNNs doesn't help the training procedure. We also show that when applied to the
11: input-to-hidden transitions, batch normalization can lead to a faster convergence of the
12: training criterion but doesn't seem to improve the generalization performance
13: on both our language modelling and speech recognition tasks.
14: All in all, applying batch normalization to RNNs turns out to be more
15: challenging than applying it to feedforward networks, but certain variants of it
16: can still be beneficial.
17: \end{abstract}
18: