ce3c9d11132e0dff.tex
1: \begin{abstract}
2: \vspace{-0.4cm}
3: %Large-scale pre-training and downstream fine-tuning paradigm 
4: The paradigm of large-scale pre-training followed by downstream fine-tuning has been widely employed in various object detection algorithms. 
5: In this paper, we reveal discrepancies in \textbf{data}, \textbf{model}, and \textbf{task} between the pre-training and fine-tuning procedure in existing practices, which implicitly limit the detector's performance, generalization ability, and convergence speed.
6: To this end, we propose AlignDet, a unified pre-training framework that can be adapted to various existing detectors to alleviate the discrepancies. AlignDet decouples the pre-training process into two stages, i.e., image-domain and box-domain pre-training. 
7: The image-domain pre-training optimizes the detection backbone to capture holistic visual abstraction, and box-domain pre-training learns instance-level semantics and task-aware concepts to initialize the parts out of the backbone.
8: By incorporating the self-supervised pre-trained backbones, we can pre-train all modules for various detectors in an unsupervised paradigm.
9: As depicted in Figure \ref{fig:motivation}, extensive experiments demonstrate that AlignDet can achieve significant improvements across diverse protocols, such as \textcolor{teal}{detection algorithm}, \textcolor{blue}{model backbone}, \textcolor{red}{data setting}, and \textcolor{cyan}{training schedule}. 
10: For example, AlignDet improves FCOS by \textbf{5.3} mAP, RetinaNet by  \textbf{2.1} mAP, Faster R-CNN by \textbf{3.3} mAP, and DETR by \textbf{2.3} mAP under fewer epochs. 
11: 
12: 
13: 
14: 
15: 
16: 
17: \vspace{-0.35cm}
18: 
19: \end{abstract}
20: