7a2cf3244a15ad48.tex
1: \begin{abstract}
2:     Transformers have shown promising progress in various visual object detection tasks, including monocular 2D/3D detection and surround-view 3D detection. %
3: 	\polish{%
4: 	More importantly, the attention mechanism in the Transformer model and the image correspondence in binocular stereo are both similarity-based. %
5: 	However, directly applying existing Transformer-based detectors to binocular stereo 3D object detection leads to slow convergence and significant precision drops. %
6: 	We argue that a key cause of this defect is that existing Transformers ignore the stereo-specific image correspondence information. %
7: 	%
8: 	}%
9: 	In this paper, we explore the model design of Transformers in binocular 3D object detection, focusing particularly on extracting and encoding the task-specific image correspondence information. %
10: 	To achieve this goal, we present TS3D, a Transformer-based Stereo-aware 3D object detector. %
11: 	In the TS3D, a Disparity-Aware Positional Encoding (DAPE) module is proposed to embed the image correspondence information into stereo features. %
12: 	\polish{%
13: 	The correspondence is encoded as normalized sub-pixel-level disparity and is used in conjunction with sinusoidal 2D positional encoding to provide the 3D location information of the scene. %
14: 	}%
15: 	\revise{%
16: 	To extract enriched multi-scale stereo features, we propose a Stereo Preserving Feature Pyramid Network (SPFPN). %
17: 	The SPFPN is designed to preserve the correspondence information while fusing intra-scale and aggregating cross-scale stereo features. %
18: 	}%
19: 	Our proposed TS3D achieves a 41.29\% Moderate Car detection average precision on the KITTI test set and takes 88 ms to detect objects from each binocular image pair.
20: 	It is competitive with advanced counterparts in terms of both precision and inference speed. %
21: \end{abstract}
22: