fe34d41d05d872d6.tex
1: \begin{abstract}
2: Audio-Visual Speech Recognition (AVSR) seeks to model, and thereby exploit, the dynamic relationship between a human voice and the corresponding mouth movements.  A recently proposed multimodal fusion strategy, \emph{AV Align}, based on state-of-the-art sequence to sequence neural networks, attempts to model this relationship by explicitly aligning the acoustic and visual representations of speech. This study investigates the inner workings of \emph{AV Align} and visualises the audio-visual alignment patterns. Our experiments are performed on two of the largest publicly available AVSR datasets, TCD-TIMIT and LRS2. We find that \emph{AV Align} learns to align acoustic and visual representations of speech at the frame level on TCD-TIMIT in a generally monotonic pattern. We also determine the cause of initially seeing no improvement over audio-only speech recognition on the more challenging LRS2. We propose a regularisation method which involves predicting lip-related Action Units from visual representations. Our regularisation method leads to better exploitation of the visual modality, with performance improvements between 7\% and 30\% depending on the noise level. Furthermore, we show that the alternative \emph{Watch, Listen, Attend, and Spell} network is affected by the same problem as \emph{AV Align}, and that our proposed approach can effectively help it learn visual representations.
3: Our findings validate the suitability of the regularisation method to AVSR and encourage researchers to rethink the multimodal convergence problem when having one dominant modality.
4: \end{abstract}
5: