1: \begin{abstract}
2: % motivations:
3: % (1) why MIM-based methods generalize well than supervised one
4: % (2) Investigating gradient correction effect of RC-MAE
5: % Recently, self-supervised learning approaches based on masked Image modeling~(MIM) with Vision Transformer have emerged rapidly due to their superior performance and scalability.
6:
7: The Masked autoencoder (MAE) has drawn attention as a representative self-supervised approach for masked image modeling with vision transformers. However, even though MAE shows better generalization capability than fully supervised training from scratch, the reason why has not been explored.
8: In another line of work, the Reconstruction Consistent Masked Auto Encoder (RC-MAE), has been proposed which adopts a self-distillation scheme in the form of an exponential moving average (EMA) teacher into MAE, and it has been shown that the EMA-teacher performs a conditional gradient correction during optimization. To further investigate the reason for better generalization of the self-supervised ViT when trained by MAE (MAE-ViT) and the effect of the gradient correction of RC-MAE from the perspective of optimization, we visualize the loss landscapes of the self-supervised vision transformer by both MAE and RC-MAE and compare them with the supervised ViT (Sup-ViT). Unlike previous loss landscape visualizations of neural networks based on classification task loss, we visualize the loss landscape of ViT by computing pre-training task loss.
9: Through the lens of loss landscapes, we find two interesting observations: (1) MAE-ViT has a smoother and wider overall loss curvature than Sup-ViT. (2) The EMA-teacher allows MAE to widen the region of convexity in both pretraining and linear probing, leading to quicker convergence.
10: To the best of our knowledge, this work is the first to investigate the self-supervised ViT through the lens of the loss landscape.
11:
12: \end{abstract}
13: