a199a98bd9ff675e.tex
1: \begin{abstract}
2: 
3: Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities.
4: % In this paper, we unify the speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks to a conditional language model task and propose one auto-regressive Transformer decoder-only network, namely \textbf{\our{}},  by jointly modeling cross-modalities of speech and text with a multi-task learning framework. 
5: In this paper, we propose \textbf{\our{}}, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a conditional codec language model task via multi-task learning framework.
6: To accomplish this, we first convert all the speech utterances to discrete tokens (similar to the textual data) using an offline neural codec encoder.
7: In such a way, all these tasks are converted to token-based sequence conversion problems, which can be naturally handled with one conditional language model. 
8: %We accomplish this using a Transformer-based decoder-only network, optimized with a multi-task learning framework. 
9: We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks.
10: %With the unified modality, model structure, and training objective, involved tasks can boost each other by sharing the parameters and leveraging all the training data.
11: Experimental results demonstrate that the proposed \our{} model can support both single-modal and cross-modal tasks well, and the decoder-only model achieves a comparable and even better performance than the strong baselines. 
12: 
13: 
14: \end{abstract}
15: