abstract:5c1b64b347ccb821.tex

1: \begin{abstract}

2: End-to-end models reach state-of-the-art performance

3: for speech recognition,

4: but global soft attention is not monotonic,

5: which might lead to convergence problems, to instability,

6: to bad generalisation,

7: cannot be used for online streaming,

8: and is also inefficient in calculation.

9: Monotonicity can potentially fix all of this.

10: There are several ad-hoc solutions or heuristics to introduce monotonicity,

11: but a principled introduction is rarely found in literature so far.

12: In this paper, we present a mathematically clean solution to introduce monotonicity,

13: by introducing a new latent variable which represents the audio position or segment boundaries.

14: We compare several monotonic latent models to our global soft attention baseline

15: such as a hard attention model, a local windowed soft attention model,

16: and a segmental soft attention model.

17: We can show that our monotonic models perform as good as the global soft attention model.

18: We perform our experiments on Switchboard 300h.

19: We carefully outline the details of our training and release our code and configs.

20: \end{abstract}

21: