1: \begin{abstract}
2: Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image.
3: Most existing methods use independent image-text encoding and apply complex hand-crafted modules or encoder-decoder architectures for modal interaction and query reasoning.
4: However, their performance significantly drops when dealing with complex textual expressions.
5: This is because the former paradigm only utilizes limited downstream data to fit the multi-modal feature fusion. Therefore, it is only effective when the textual expressions are relatively simple.
6: In contrast, given the wide diversity of textual expressions and the uniqueness of downstream training data, the existing fusion module, which extracts multimodal content from a visual-linguistic context, has not been fully investigated.
7: In this paper, we present a simple yet robust transformer-based framework, SimVG, for visual grounding.
8: Specifically, we decouple visual-linguistic feature fusion from downstream tasks by leveraging existing multimodal pre-trained models and incorporating additional object tokens to facilitate deep integration of downstream and pre-training tasks.
9: Furthermore, we design a dynamic weight-balance distillation method in the multi-branch synchronous learning process to enhance the representation capability of the simpler branch.
10: This branch only consists of a lightweight MLP, which simplifies the structure and improves reasoning speed.
11: Experiments on six widely used VG datasets, \textit{i.e.}, RefCOCO/+/g, ReferIt, Flickr30K, and GRefCOCO, demonstrate the superiority of SimVG.
12: Finally, the proposed method not only achieves improvements in efficiency and convergence speed but also attains new state-of-the-art performance on these benchmarks.
13: Codes and models will be available at \url{https://github.com/Dmmm1997/SimVG}.
14: \end{abstract}
15: