5dab5a3e267a33d2.tex
1: \begin{abstract}
2:     Self-supervised acoustic pre-training has achieved impressive results on low-resource speech recognition tasks. It indicates that the pretrain-and-finetune paradigm is a promising direction.
3:     In this work, we propose an end-to-end model for the low-resource speech recognition, which fuses a pre-trained audio encoder (wav2vec2.0) and a pre-trained text decoder (BERT). The two modules are connected by a linear attention mechanism without parameters. A fully connected layer is introduced for hidden mapping between speech and language modalities.
4:     Besides, we design an effective fine-tuning strategy to preserve and utilize the text context modeling ability of the pre-trained decoder. Equipped with this strategy, our model exhibits distinct faster convergence and better performance. 
5:     % make the two separately pre-trained modules work well as a whole.
6:     Our model achieves comparable recognition performance on CALLHOME corpus (15 hours) with the SOTA pipeline modeling.
7: \end{abstract}
8: