d7e455696989575e.tex
1: \begin{abstract}
2: Sequence-to-sequence attention-based models on subword units
3: allow simple open-vocabulary end-to-end speech recognition.
4: In this work, we show that such models can achieve competitive
5: results on the Switchboard 300h and LibriSpeech 1000h tasks.
6: In particular, we report the state-of-the-art word error rates (WER)
7: of 3.54\% on the dev-clean and 3.82\% on the test-clean evaluation subsets of LibriSpeech.
8: We introduce a new pretraining scheme by starting with
9: a high time reduction factor and lowering it during training,
10: which is crucial both for convergence and final performance.
11: In some experiments, we also use an auxiliary CTC loss function
12: to help the convergence. In addition, we train long short-term
13: memory (LSTM) language models on subword units.
14: By shallow fusion, we report up to 27\% relative improvements in WER
15: over the attention baseline without a language model.
16: \end{abstract}
17: