64fee7ef0878b5d6.tex
1: \begin{abstract}
2: % 1000 characters. ASCII characters only. No citations.
3: In this paper, we provide a new perspective on self-supervised speech models from how the training targets are obtained. 
4: We generalize the targets extractor into Offline Targets Extractor (Off-TE) and Online Targets Extractor (On-TE). 
5: Based on this, we propose a new multi-tasking learning framework for self-supervised learning, MT4SSL, which stands for Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets. 
6: % MT4SSL refers to two typical models, HuBERT and data2vec, which use the K-means algorithm as an Off-TE and a teacher network without gradients as an On-TE, respectively. 
7: MT4SSL uses the K-means algorithm as an Off-TE and a teacher network without gradients as an On-TE, respectively. 
8: % Our model outperforms previous SSL methods by nontrivial margins on the LibriSpeech benchmark, and is comparable to or even better than the best-performing models with no need for that much data. 
9: Our model outperforms previous SSL methods by nontrivial margins on the LibriSpeech benchmark, and is comparable to or even better than the best-performing models with fewer data. 
10: Furthermore, we find that using both Off-TE and On-TE results in better convergence in the pre-training phase. 
11: With both effectiveness and efficiency, we think doing multi-task learning on self-supervised speech models from our perspective is a promising trend. Code is available at \url{https://github.com/ddlBoJack/MT4SSL}. 
12: \end{abstract}
13: