abstract:8d21397d7fa208d0.tex

1: \begin{abstract}

2: We introduce \our{}, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (\eg, bounding boxes) and grounding text to the visual world.

3: Specifically, we represent refer expressions as links in Markdown, \ie, ``\texttt{[text span](bounding boxes)}'', where object descriptions are sequences of location tokens.

4: Together with multimodal corpora, we construct large-scale data of grounded image-text pairs (called \textsc{GrIT}) to train the model.

5: In addition to the existing capabilities of MLLMs (\eg, perceiving general modalities, following instructions, and performing in-context learning), \our{} integrates the grounding capability into downstream applications.

6: We evaluate \our{} on a wide range of tasks, including (i) multimodal grounding, such as referring expression comprehension, and phrase grounding, (ii) multimodal referring, such as referring expression generation, (iii) perception-language tasks, and (iv) language understanding and generation.

7: This work lays out the foundation for the development of Embodiment AI and sheds light on the big convergence of language, multimodal perception, action, and world modeling, which is a key step toward artificial general intelligence.

8: Code and pretrained models are available at \url{https://aka.ms/kosmos-2}.

9: \end{abstract}

10: