2b49d639690bcdd4.tex
1: \begin{abstract}
2: % 
3: 3D dense captioning requires a model to translate its understanding of an input 3D scene into several captions associated with different object regions.
4: % 
5: Existing methods adopt a sophisticated “detect-then-describe” pipeline, which builds explicit relation modules upon a 3D detector with numerous hand-crafted components.
6: % 
7: While these methods have achieved initial success, the cascade pipeline tends to accumulate errors because of duplicated and inaccurate box estimations and messy 3D scenes.
8: % 
9: In this paper, we first propose Vote2Cap-DETR, a simple-yet-effective transformer framework that decouples the decoding process of caption generation and object localization through parallel decoding.
10: % 
11: % We show that the sophisticated and explicit relation reasoning modules can be replaced by the attention mechanism to capture both object-object and object-scene relations.
12: % 
13: \whatsnew{
14: Moreover, we argue that object localization and description generation require different levels of scene understanding, which could be challenging for a shared set of queries to capture.
15: % 
16: To this end, we propose an advanced version, Vote2Cap-DETR++, which decouples the queries into localization and caption queries to capture task-specific features.
17: % 
18: Additionally, we introduce the iterative spatial refinement strategy to vote queries for faster convergence and better localization performance.
19: % 
20: We also insert additional spatial information to the caption head for more accurate descriptions.
21: % 
22: Without bells and whistles, extensive experiments on two commonly used datasets, ScanRefer and Nr3D, demonstrate Vote2Cap-DETR and Vote2Cap-DETR++ surpass conventional ``detect-then-describe'' methods by a large margin.
23: }
24: % 
25: Codes will be made available at \href{https://github.com/ch3cook-fdu/Vote2Cap-DETR}{https://github.com/ch3cook-fdu/Vote2Cap-DETR}.
26: \end{abstract}
27: