abstract:a199a98bd9ff675e.tex

1: \begin{abstract}

2:

3: Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities.

4: % In this paper, we unify the speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks to a conditional language model task and propose one auto-regressive Transformer decoder-only network, namely \textbf{\our{}},  by jointly modeling cross-modalities of speech and text with a multi-task learning framework.

5: In this paper, we propose \textbf{\our{}}, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a conditional codec language model task via multi-task learning framework.

6: To accomplish this, we first convert all the speech utterances to discrete tokens (similar to the textual data) using an offline neural codec encoder.

7: In such a way, all these tasks are converted to token-based sequence conversion problems, which can be naturally handled with one conditional language model.

8: %We accomplish this using a Transformer-based decoder-only network, optimized with a multi-task learning framework.

9: We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks.

10: %With the unified modality, model structure, and training objective, involved tasks can boost each other by sharing the parameters and leveraging all the training data.

11: Experimental results demonstrate that the proposed \our{} model can support both single-modal and cross-modal tasks well, and the decoder-only model achieves a comparable and even better performance than the strong baselines.

12:

13:

14: \end{abstract}

15: