abstract:49125103f0d372f0.tex

1: \begin{abstract}

2:

3: We present \textbf{SEED}, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the emergent ability to \textbf{SEE} and \textbf{D}raw at the same time.

4: %

5: Research on image tokenizers has previously reached an impasse, as frameworks employing quantized visual tokens have lost prominence due to subpar performance and convergence in multimodal comprehension (compared to BLIP-2, etc.) or generation (compared to Stable Diffusion, etc.).

6: %

7: Despite the limitations, we remain confident in its natural capacity to unify visual and textual representations, facilitating scalable multimodal training with LLM's original recipe.

8: %

9: In this study, we identify two crucial principles for the architecture and training of SEED that effectively ease subsequent alignment with LLMs.

10: % Arch

11: (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a \textit{1D causal dependency}, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs.

12: % Training objective

13: (2) Image tokens should capture \textit{high-level semantics} consistent with the degree of semantic abstraction in words, and be optimized for both discriminativeness and reconstruction during the tokenizer training phase.

14: %

15: As a result, the off-the-shelf LLM is able to perform both image-to-text and text-to-image generation by incorporating our SEED through efficient LoRA tuning.

16: %

17: Comprehensive multimodal pretraining and instruction tuning, which may yield improved results, are reserved for future investigation.

18: %

19: This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs.

20: %

21: Our preliminary study emphasizes the great potential of discrete visual tokens in versatile multimodal LLMs and the importance of proper image tokenizers in broader research.

22: \end{abstract}

23: