1: \begin{abstract}
2:
3: Robustness and discrimination power are two fundamental requirements in visual object tracking.
4: %
5: In most tracking paradigms, we find that the features extracted by the popular Siamese-like networks cannot fully discriminatively model the tracked targets and distractor objects, hindering them from simultaneously meeting these two requirements.
6: %
7: While most methods focus on designing robust correlation operations, we propose a novel target-dependent feature network inspired by the self-/cross-attention scheme.
8: %
9: In contrast to the Siamese-like feature extraction,
10: %
11: our network deeply embeds cross-image feature correlation in multiple layers of the feature network.
12: %
13: By extensively matching the features of the two images through multiple layers, it is able to suppress non-target features, resulting in instance-varying feature extraction.
14: %
15: The output features of the search image can be directly used for predicting target locations without extra correlation step.
16: %
17: Moreover, our model can be flexibly pre-trained on abundant unpaired images, leading to notably faster convergence than the existing methods.
18: %
19: Extensive experiments show our method achieves the state-of-the-art results while running at real-time. Our feature networks also can be applied to existing tracking pipelines seamlessly to raise the tracking performance. %
20: Code will be available.
21:
22: \end{abstract}
23: