abstract:a8d1d78cbeb1f47f.tex

1: \begin{abstract}

2: Foundation models have received much attention due to their effectiveness across a broad range of downstream applications.

3: Though there is a big convergence in terms of architecture, most pretrained models are typically still developed for specific tasks or modalities.

4: In this work, we propose to use language models as a general-purpose interface to various foundation models.

5: A collection of pretrained encoders perceive diverse modalities (such as vision, and language), and they dock with a language model that plays the role of a universal task layer.

6: We propose a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders.

7: We subsume the advantages and capabilities from both causal and non-causal modeling, thereby combining the best of two worlds.

8: Specifically, the proposed method not only inherits the capabilities of in-context learning and open-ended generation from causal language modeling, but also is conducive to finetuning because of the bidirectional encoders.

9: More importantly, our approach seamlessly unlocks the combinations of the above capabilities, e.g., enabling in-context learning or instruction following with finetuned encoders.

10: Experimental results across various language-only and vision-language benchmarks show that our model outperforms or is competitive with specialized models on finetuning, zero-shot generalization, and few-shot learning.

11: \end{abstract}

12: