abstract:ce3c9d11132e0dff.tex

1: \begin{abstract}

2: \vspace{-0.4cm}

3: %Large-scale pre-training and downstream fine-tuning paradigm

4: The paradigm of large-scale pre-training followed by downstream fine-tuning has been widely employed in various object detection algorithms.

5: In this paper, we reveal discrepancies in \textbf{data}, \textbf{model}, and \textbf{task} between the pre-training and fine-tuning procedure in existing practices, which implicitly limit the detector's performance, generalization ability, and convergence speed.

6: To this end, we propose AlignDet, a unified pre-training framework that can be adapted to various existing detectors to alleviate the discrepancies. AlignDet decouples the pre-training process into two stages, i.e., image-domain and box-domain pre-training.

7: The image-domain pre-training optimizes the detection backbone to capture holistic visual abstraction, and box-domain pre-training learns instance-level semantics and task-aware concepts to initialize the parts out of the backbone.

8: By incorporating the self-supervised pre-trained backbones, we can pre-train all modules for various detectors in an unsupervised paradigm.

9: As depicted in Figure \ref{fig:motivation}, extensive experiments demonstrate that AlignDet can achieve significant improvements across diverse protocols, such as \textcolor{teal}{detection algorithm}, \textcolor{blue}{model backbone}, \textcolor{red}{data setting}, and \textcolor{cyan}{training schedule}.

10: For example, AlignDet improves FCOS by \textbf{5.3} mAP, RetinaNet by  \textbf{2.1} mAP, Faster R-CNN by \textbf{3.3} mAP, and DETR by \textbf{2.3} mAP under fewer epochs.

11:

12:

13:

14:

15:

16:

17: \vspace{-0.35cm}

18:

19: \end{abstract}

20: