1: \begin{abstract}
2: Transformers have shown promising progress in various visual object detection tasks, including monocular 2D/3D detection and surround-view 3D detection. %
3: \polish{%
4: More importantly, the attention mechanism in the Transformer model and the image correspondence in binocular stereo are both similarity-based. %
5: However, directly applying existing Transformer-based detectors to binocular stereo 3D object detection leads to slow convergence and significant precision drops. %
6: We argue that a key cause of this defect is that existing Transformers ignore the stereo-specific image correspondence information. %
7: %
8: }%
9: In this paper, we explore the model design of Transformers in binocular 3D object detection, focusing particularly on extracting and encoding the task-specific image correspondence information. %
10: To achieve this goal, we present TS3D, a Transformer-based Stereo-aware 3D object detector. %
11: In the TS3D, a Disparity-Aware Positional Encoding (DAPE) module is proposed to embed the image correspondence information into stereo features. %
12: \polish{%
13: The correspondence is encoded as normalized sub-pixel-level disparity and is used in conjunction with sinusoidal 2D positional encoding to provide the 3D location information of the scene. %
14: }%
15: \revise{%
16: To extract enriched multi-scale stereo features, we propose a Stereo Preserving Feature Pyramid Network (SPFPN). %
17: The SPFPN is designed to preserve the correspondence information while fusing intra-scale and aggregating cross-scale stereo features. %
18: }%
19: Our proposed TS3D achieves a 41.29\% Moderate Car detection average precision on the KITTI test set and takes 88 ms to detect objects from each binocular image pair.
20: It is competitive with advanced counterparts in terms of both precision and inference speed. %
21: \end{abstract}
22: