0c1ee4ea6aef4ccd.tex
1: \begin{abstract}
2: The recently-developed DETR approach
3: applies the transformer encoder and decoder architecture
4: to object detection
5: and
6: achieves promising performance.
7: In this paper,
8: we handle the critical issue, slow training convergence,
9: and present a conditional cross-attention mechanism
10: for fast DETR training.
11: Our approach is motivated by 
12: that the cross-attention in DETR relies highly on 
13: the content embeddings 
14: for localizing the four extremities
15: and predicting the box,
16: which increases the need for high-quality content embeddings
17: and thus the training difficulty.
18: 
19: 
20: Our approach,
21: named conditional DETR,
22: learns a conditional spatial query
23: from the decoder embedding
24: for decoder multi-head cross-attention.
25: The benefit is that
26: through the conditional spatial query,
27: each cross-attention head is able to attend
28: to a band containing a distinct region,
29: e.g., one object extremity
30: or a region inside the object box.
31: This narrows down
32: the spatial range for localizing the distinct regions
33: for object classification and box regression,
34: thus relaxing the dependence on 
35: the content embeddings and
36: easing the training.
37: Empirical results show that
38: conditional DETR converges $6.7\times$ faster 
39: for the backbones R$50$ and R$101$
40: and $10\times$ faster for stronger backbones 
41: DC$5$-R$50$ and DC$5$-R$101$.
42: Code is available at~\url{https://github.com/Atten4Vis/ConditionalDETR}.
43: 
44: \end{abstract}
45: