abstract:d7e455696989575e.tex

1: \begin{abstract}

2: Sequence-to-sequence attention-based models on subword units

3: allow simple open-vocabulary end-to-end speech recognition.

4: In this work, we show that such models can achieve competitive

5: results on the Switchboard 300h and LibriSpeech 1000h tasks.

6: In particular, we report the state-of-the-art word error rates (WER)

7: of 3.54\% on the dev-clean and 3.82\% on the test-clean evaluation subsets of LibriSpeech.

8: We introduce a new pretraining scheme by starting with

9: a high time reduction factor and lowering it during training,

10: which is crucial both for convergence and final performance.

11: In some experiments, we also use an auxiliary CTC loss function

12: to help the convergence. In addition, we train long short-term

13: memory (LSTM) language models on subword units.

14: By shallow fusion, we report up to 27\% relative improvements in WER

15: over the attention baseline without a language model.

16: \end{abstract}

17: