a8d1d78cbeb1f47f.tex
1: \begin{abstract}
2: Foundation models have received much attention due to their effectiveness across a broad range of downstream applications.
3: Though there is a big convergence in terms of architecture, most pretrained models are typically still developed for specific tasks or modalities.
4: In this work, we propose to use language models as a general-purpose interface to various foundation models.
5: A collection of pretrained encoders perceive diverse modalities (such as vision, and language), and they dock with a language model that plays the role of a universal task layer.
6: We propose a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders.
7: We subsume the advantages and capabilities from both causal and non-causal modeling, thereby combining the best of two worlds.
8: Specifically, the proposed method not only inherits the capabilities of in-context learning and open-ended generation from causal language modeling, but also is conducive to finetuning because of the bidirectional encoders.
9: More importantly, our approach seamlessly unlocks the combinations of the above capabilities, e.g., enabling in-context learning or instruction following with finetuned encoders.
10: Experimental results across various language-only and vision-language benchmarks show that our model outperforms or is competitive with specialized models on finetuning, zero-shot generalization, and few-shot learning.
11: \end{abstract}
12: