abstract:5dab5a3e267a33d2.tex

1: \begin{abstract}

2:     Self-supervised acoustic pre-training has achieved impressive results on low-resource speech recognition tasks. It indicates that the pretrain-and-finetune paradigm is a promising direction.

3:     In this work, we propose an end-to-end model for the low-resource speech recognition, which fuses a pre-trained audio encoder (wav2vec2.0) and a pre-trained text decoder (BERT). The two modules are connected by a linear attention mechanism without parameters. A fully connected layer is introduced for hidden mapping between speech and language modalities.

4:     Besides, we design an effective fine-tuning strategy to preserve and utilize the text context modeling ability of the pre-trained decoder. Equipped with this strategy, our model exhibits distinct faster convergence and better performance.

5:     % make the two separately pre-trained modules work well as a whole.

6:     Our model achieves comparable recognition performance on CALLHOME corpus (15 hours) with the SOTA pipeline modeling.

7: \end{abstract}

8: