1: \begin{abstract}
2: \vspace{-0.4cm}
3: %Large-scale pre-training and downstream fine-tuning paradigm
4: The paradigm of large-scale pre-training followed by downstream fine-tuning has been widely employed in various object detection algorithms.
5: In this paper, we reveal discrepancies in \textbf{data}, \textbf{model}, and \textbf{task} between the pre-training and fine-tuning procedure in existing practices, which implicitly limit the detector's performance, generalization ability, and convergence speed.
6: To this end, we propose AlignDet, a unified pre-training framework that can be adapted to various existing detectors to alleviate the discrepancies. AlignDet decouples the pre-training process into two stages, i.e., image-domain and box-domain pre-training.
7: The image-domain pre-training optimizes the detection backbone to capture holistic visual abstraction, and box-domain pre-training learns instance-level semantics and task-aware concepts to initialize the parts out of the backbone.
8: By incorporating the self-supervised pre-trained backbones, we can pre-train all modules for various detectors in an unsupervised paradigm.
9: As depicted in Figure \ref{fig:motivation}, extensive experiments demonstrate that AlignDet can achieve significant improvements across diverse protocols, such as \textcolor{teal}{detection algorithm}, \textcolor{blue}{model backbone}, \textcolor{red}{data setting}, and \textcolor{cyan}{training schedule}.
10: For example, AlignDet improves FCOS by \textbf{5.3} mAP, RetinaNet by \textbf{2.1} mAP, Faster R-CNN by \textbf{3.3} mAP, and DETR by \textbf{2.3} mAP under fewer epochs.
11:
12:
13:
14:
15:
16:
17: \vspace{-0.35cm}
18:
19: \end{abstract}
20: