59431b2cd7736f47.tex
1: \begin{abstract}
2: 
3: This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks.
4: Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images.
5: We introduce an additional pre-pretraining stage that is simple and uses the self-supervised \mae technique to initialize the model.
6: While \mae has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well.
7: Thus, our \mae-based pre-pretraining scales with both model and data size making it applicable for training foundation models.
8: Pre-pretraining consistently improves both the model convergence and the downstream transfer performance across a range of model scales (millions to billions of parameters), and dataset sizes (millions to billions of images).
9: We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition.
10: Our largest model achieves new state-of-the-art results on \inat (91.3\%), 1-shot \inetOneK (62.1\%), and zero-shot transfer on 
11: \food (96.0\%).
12: Our study reveals that model initialization plays a significant role, even for web-scale pretraining with billions of images.
13: \end{abstract}
14: