abstract:0c1ee4ea6aef4ccd.tex

1: \begin{abstract}

2: The recently-developed DETR approach

3: applies the transformer encoder and decoder architecture

4: to object detection

5: and

6: achieves promising performance.

7: In this paper,

8: we handle the critical issue, slow training convergence,

9: and present a conditional cross-attention mechanism

10: for fast DETR training.

11: Our approach is motivated by

12: that the cross-attention in DETR relies highly on

13: the content embeddings

14: for localizing the four extremities

15: and predicting the box,

16: which increases the need for high-quality content embeddings

17: and thus the training difficulty.

18:

19:

20: Our approach,

21: named conditional DETR,

22: learns a conditional spatial query

23: from the decoder embedding

24: for decoder multi-head cross-attention.

25: The benefit is that

26: through the conditional spatial query,

27: each cross-attention head is able to attend

28: to a band containing a distinct region,

29: e.g., one object extremity

30: or a region inside the object box.

31: This narrows down

32: the spatial range for localizing the distinct regions

33: for object classification and box regression,

34: thus relaxing the dependence on

35: the content embeddings and

36: easing the training.

37: Empirical results show that

38: conditional DETR converges $6.7\times$ faster

39: for the backbones R$50$ and R$101$

40: and $10\times$ faster for stronger backbones

41: DC$5$-R$50$ and DC$5$-R$101$.

42: Code is available at~\url{https://github.com/Atten4Vis/ConditionalDETR}.

43:

44: \end{abstract}

45: