be4ba94309480703.tex
1: \begin{abstract}
2: In this paper,
3: we are interested in Detection Transformer (DETR),
4: an end-to-end object detection approach
5: based on a transformer encoder-decoder architecture
6: without hand-crafted postprocessing, such as NMS.
7: Inspired by Conditional DETR,
8: an improved DETR with fast training convergence,
9: that presented box queries (originally called spatial queries)
10: for internal decoder layers,
11: we reformulate the object query
12: into the format of the box query
13: that is a composition
14: of the embeddings
15: of the reference point
16: and the transformation
17: of the box with respect
18: to the reference point.
19: This reformulation 
20: indicates the connection
21: between the object query in DETR
22: and the anchor box that is widely studied
23: in Faster R-CNN.
24: Furthermore,
25: we learn the box queries from the image content,
26: further improving the detection quality
27: of Conditional DETR
28: still with fast training convergence.
29: In addition,
30: we adopt the idea of axial self-attention to save the memory cost and accelerate the encoder.
31: The resulting detector, called Conditional DETR V2,
32: achieves better results
33: than Conditional DETR,
34: saves the memory cost
35: and runs more efficiently.
36: For example,
37: for the DC$5$-ResNet-$50$ backbone,
38: our approach
39: achieves $44.8$ AP with $16.4$ FPS on the COCO $val$ set and compared to Conditional DETR, it
40: runs $1.6\times$ faster, saves $74$\% of the overall memory cost, and improves $1.0$ AP score.
41: 
42: \end{abstract}
43: