abstract:59431b2cd7736f47.tex

1: \begin{abstract}

2:

3: This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks.

4: Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images.

5: We introduce an additional pre-pretraining stage that is simple and uses the self-supervised \mae technique to initialize the model.

6: While \mae has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well.

7: Thus, our \mae-based pre-pretraining scales with both model and data size making it applicable for training foundation models.

8: Pre-pretraining consistently improves both the model convergence and the downstream transfer performance across a range of model scales (millions to billions of parameters), and dataset sizes (millions to billions of images).

9: We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition.

10: Our largest model achieves new state-of-the-art results on \inat (91.3\%), 1-shot \inetOneK (62.1\%), and zero-shot transfer on

11: \food (96.0\%).

12: Our study reveals that model initialization plays a significant role, even for web-scale pretraining with billions of images.

13: \end{abstract}

14: