abstract:b6e13c38f89bfe42.tex

1: \begin{abstract}

2: Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image.

3: Most existing methods use independent image-text encoding and apply complex hand-crafted modules or encoder-decoder architectures for modal interaction and query reasoning.

4: However, their performance significantly drops when dealing with complex textual expressions.

5: This is because the former paradigm only utilizes limited downstream data to fit the multi-modal feature fusion. Therefore, it is only effective when the textual expressions are relatively simple.

6: In contrast, given the wide diversity of textual expressions and the uniqueness of downstream training data, the existing fusion module, which extracts multimodal content from a visual-linguistic context, has not been fully investigated.

7: In this paper, we present a simple yet robust transformer-based framework, SimVG, for visual grounding.

8: Specifically, we decouple visual-linguistic feature fusion from downstream tasks by leveraging existing multimodal pre-trained models and incorporating additional object tokens to facilitate deep integration of downstream and pre-training tasks.

9: Furthermore, we design a dynamic weight-balance distillation method in the multi-branch synchronous learning process to enhance the representation capability of the simpler branch.

10: This branch only consists of a lightweight MLP, which simplifies the structure and improves reasoning speed.

11: Experiments on six widely used VG datasets, \textit{i.e.}, RefCOCO/+/g, ReferIt, Flickr30K, and GRefCOCO, demonstrate the superiority of SimVG.

12: Finally, the proposed method not only achieves improvements in efficiency and convergence speed but also attains new state-of-the-art performance on these benchmarks.

13: Codes and models will be available at \url{https://github.com/Dmmm1997/SimVG}.

14: 	\end{abstract}

15: