1: \begin{abstract}
2:
3: We present \textbf{SEED}, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the emergent ability to \textbf{SEE} and \textbf{D}raw at the same time.
4: %
5: Research on image tokenizers has previously reached an impasse, as frameworks employing quantized visual tokens have lost prominence due to subpar performance and convergence in multimodal comprehension (compared to BLIP-2, etc.) or generation (compared to Stable Diffusion, etc.).
6: %
7: Despite the limitations, we remain confident in its natural capacity to unify visual and textual representations, facilitating scalable multimodal training with LLM's original recipe.
8: %
9: In this study, we identify two crucial principles for the architecture and training of SEED that effectively ease subsequent alignment with LLMs.
10: % Arch
11: (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a \textit{1D causal dependency}, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs.
12: % Training objective
13: (2) Image tokens should capture \textit{high-level semantics} consistent with the degree of semantic abstraction in words, and be optimized for both discriminativeness and reconstruction during the tokenizer training phase.
14: %
15: As a result, the off-the-shelf LLM is able to perform both image-to-text and text-to-image generation by incorporating our SEED through efficient LoRA tuning.
16: %
17: Comprehensive multimodal pretraining and instruction tuning, which may yield improved results, are reserved for future investigation.
18: %
19: This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs.
20: %
21: Our preliminary study emphasizes the great potential of discrete visual tokens in versatile multimodal LLMs and the importance of proper image tokenizers in broader research.
22: \end{abstract}
23: