abstract:be4ba94309480703.tex

1: \begin{abstract}

2: In this paper,

3: we are interested in Detection Transformer (DETR),

4: an end-to-end object detection approach

5: based on a transformer encoder-decoder architecture

6: without hand-crafted postprocessing, such as NMS.

7: Inspired by Conditional DETR,

8: an improved DETR with fast training convergence,

9: that presented box queries (originally called spatial queries)

10: for internal decoder layers,

11: we reformulate the object query

12: into the format of the box query

13: that is a composition

14: of the embeddings

15: of the reference point

16: and the transformation

17: of the box with respect

18: to the reference point.

19: This reformulation

20: indicates the connection

21: between the object query in DETR

22: and the anchor box that is widely studied

23: in Faster R-CNN.

24: Furthermore,

25: we learn the box queries from the image content,

26: further improving the detection quality

27: of Conditional DETR

28: still with fast training convergence.

29: In addition,

30: we adopt the idea of axial self-attention to save the memory cost and accelerate the encoder.

31: The resulting detector, called Conditional DETR V2,

32: achieves better results

33: than Conditional DETR,

34: saves the memory cost

35: and runs more efficiently.

36: For example,

37: for the DC$5$-ResNet-$50$ backbone,

38: our approach

39: achieves $44.8$ AP with $16.4$ FPS on the COCO $val$ set and compared to Conditional DETR, it

40: runs $1.6\times$ faster, saves $74$\% of the overall memory cost, and improves $1.0$ AP score.

41:

42: \end{abstract}

43: