abstract:2b49d639690bcdd4.tex

1: \begin{abstract}

2: %

3: 3D dense captioning requires a model to translate its understanding of an input 3D scene into several captions associated with different object regions.

4: %

5: Existing methods adopt a sophisticated “detect-then-describe” pipeline, which builds explicit relation modules upon a 3D detector with numerous hand-crafted components.

6: %

7: While these methods have achieved initial success, the cascade pipeline tends to accumulate errors because of duplicated and inaccurate box estimations and messy 3D scenes.

8: %

9: In this paper, we first propose Vote2Cap-DETR, a simple-yet-effective transformer framework that decouples the decoding process of caption generation and object localization through parallel decoding.

10: %

11: % We show that the sophisticated and explicit relation reasoning modules can be replaced by the attention mechanism to capture both object-object and object-scene relations.

12: %

13: \whatsnew{

14: Moreover, we argue that object localization and description generation require different levels of scene understanding, which could be challenging for a shared set of queries to capture.

15: %

16: To this end, we propose an advanced version, Vote2Cap-DETR++, which decouples the queries into localization and caption queries to capture task-specific features.

17: %

18: Additionally, we introduce the iterative spatial refinement strategy to vote queries for faster convergence and better localization performance.

19: %

20: We also insert additional spatial information to the caption head for more accurate descriptions.

21: %

22: Without bells and whistles, extensive experiments on two commonly used datasets, ScanRefer and Nr3D, demonstrate Vote2Cap-DETR and Vote2Cap-DETR++ surpass conventional ``detect-then-describe'' methods by a large margin.

23: }

24: %

25: Codes will be made available at \href{https://github.com/ch3cook-fdu/Vote2Cap-DETR}{https://github.com/ch3cook-fdu/Vote2Cap-DETR}.

26: \end{abstract}

27: