1: \begin{abstract}
2: End-to-end models reach state-of-the-art performance
3: for speech recognition,
4: but global soft attention is not monotonic,
5: which might lead to convergence problems, to instability,
6: to bad generalisation,
7: cannot be used for online streaming,
8: and is also inefficient in calculation.
9: Monotonicity can potentially fix all of this.
10: There are several ad-hoc solutions or heuristics to introduce monotonicity,
11: but a principled introduction is rarely found in literature so far.
12: In this paper, we present a mathematically clean solution to introduce monotonicity,
13: by introducing a new latent variable which represents the audio position or segment boundaries.
14: We compare several monotonic latent models to our global soft attention baseline
15: such as a hard attention model, a local windowed soft attention model,
16: and a segmental soft attention model.
17: We can show that our monotonic models perform as good as the global soft attention model.
18: We perform our experiments on Switchboard 300h.
19: We carefully outline the details of our training and release our code and configs.
20: \end{abstract}
21: