abstract:7a2cf3244a15ad48.tex

1: \begin{abstract}

2:     Transformers have shown promising progress in various visual object detection tasks, including monocular 2D/3D detection and surround-view 3D detection. %

3: 	\polish{%

4: 	More importantly, the attention mechanism in the Transformer model and the image correspondence in binocular stereo are both similarity-based. %

5: 	However, directly applying existing Transformer-based detectors to binocular stereo 3D object detection leads to slow convergence and significant precision drops. %

6: 	We argue that a key cause of this defect is that existing Transformers ignore the stereo-specific image correspondence information. %

7: 	%

8: 	}%

9: 	In this paper, we explore the model design of Transformers in binocular 3D object detection, focusing particularly on extracting and encoding the task-specific image correspondence information. %

10: 	To achieve this goal, we present TS3D, a Transformer-based Stereo-aware 3D object detector. %

11: 	In the TS3D, a Disparity-Aware Positional Encoding (DAPE) module is proposed to embed the image correspondence information into stereo features. %

12: 	\polish{%

13: 	The correspondence is encoded as normalized sub-pixel-level disparity and is used in conjunction with sinusoidal 2D positional encoding to provide the 3D location information of the scene. %

14: 	}%

15: 	\revise{%

16: 	To extract enriched multi-scale stereo features, we propose a Stereo Preserving Feature Pyramid Network (SPFPN). %

17: 	The SPFPN is designed to preserve the correspondence information while fusing intra-scale and aggregating cross-scale stereo features. %

18: 	}%

19: 	Our proposed TS3D achieves a 41.29\% Moderate Car detection average precision on the KITTI test set and takes 88 ms to detect objects from each binocular image pair.

20: 	It is competitive with advanced counterparts in terms of both precision and inference speed. %

21: \end{abstract}

22: