1: \begin{abstract}
2: Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks.
3: However, the vision encoders set a strong inductive bias in abstracting visual representation, \eg, resolution, aspect ratio, and semantic priors, which could impede the flexibility and efficiency of the VLMs.
4: Training pure VLMs that accept the seamless vision and language inputs, \ie, without vision encoders, remains challenging and rarely explored.
5: Empirical observations reveal that direct training without encoders results in slow convergence and large performance gaps.
6: In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs.
7: Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough experiments:
8: (1) Bridging vision-language representation inside one unified decoder;
9: (2) Enhancing visual recognition capability via extra supervision.
10: With these strategies, we launch \textbf{EVE}, an encoder-free vision-language model that can be trained and forwarded efficiently.
11: Notably, solely utilizing 35M publicly accessible data, \textbf{EVE} can impressively rival the encoder-based VLMs of similar capacities across multiple vision-language benchmarks.
12: It significantly outperforms the counterpart Fuyu-8B~\cite{VLM:Fuyu-8b} with mysterious training procedures and undisclosed training data.
13: We believe that \textbf{EVE} provides a transparent and efficient route for developing pure decoder-only architecture across modalities.
14: \begin{tikzpicture}[remember picture,overlay,shift={(current page.north west)}]
15: \node[anchor=north west,xshift=3.2cm,yshift=-3cm]{\includegraphics[height=0.1\textwidth]{figures/eve_log.png}};
16: \end{tikzpicture}
17: \end{abstract}
18: