abstract:fcdbfbd7766583bf.tex

1: \begin{abstract}

2: Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks.

3: However, the vision encoders set a strong inductive bias in abstracting visual representation, \eg, resolution, aspect ratio, and semantic priors, which could impede the flexibility and efficiency of the VLMs.

4: Training pure VLMs that accept the seamless vision and language inputs, \ie, without vision encoders, remains challenging and rarely explored.

5: Empirical observations reveal that direct training without encoders results in slow convergence and large performance gaps.

6: In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs.

7: Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough experiments:

8: (1) Bridging vision-language representation inside one unified decoder;

9: (2) Enhancing visual recognition capability via extra supervision.

10: With these strategies, we launch \textbf{EVE}, an encoder-free vision-language model that can be trained and forwarded efficiently.

11: Notably, solely utilizing 35M publicly accessible data, \textbf{EVE} can impressively rival the encoder-based VLMs of similar capacities across multiple vision-language benchmarks.

12: It significantly outperforms the counterpart Fuyu-8B~\cite{VLM:Fuyu-8b} with mysterious training procedures and undisclosed training data.

13: We believe that \textbf{EVE} provides a transparent and efficient route for developing pure decoder-only architecture across modalities.

14: \begin{tikzpicture}[remember picture,overlay,shift={(current page.north west)}]

15: \node[anchor=north west,xshift=3.2cm,yshift=-3cm]{\includegraphics[height=0.1\textwidth]{figures/eve_log.png}};

16: \end{tikzpicture}

17: \end{abstract}

18: