abstract:3c8b38f7a17d7654.tex

1: \begin{abstract}

2: % motivations:

3: % (1) why MIM-based methods generalize well than supervised one

4: % (2) Investigating gradient correction effect of RC-MAE

5: % Recently, self-supervised learning approaches based on masked Image modeling~(MIM) with Vision Transformer have emerged rapidly due to their superior performance and scalability.

6:

7: The Masked autoencoder (MAE) has drawn attention as a representative self-supervised approach for masked image modeling with vision transformers. However, even though MAE shows better generalization capability than fully supervised training from scratch, the reason why has not been explored.

8: In another line of work, the Reconstruction Consistent Masked Auto Encoder (RC-MAE), has been proposed which adopts a self-distillation scheme in the form of an exponential moving average (EMA) teacher into MAE, and it has been shown that the EMA-teacher performs a conditional gradient correction during optimization. To further investigate the reason for better generalization of the self-supervised ViT when trained by MAE (MAE-ViT) and the effect of the gradient correction of RC-MAE from the perspective of optimization, we visualize the loss landscapes of the self-supervised vision transformer by both MAE and RC-MAE and compare them with the supervised ViT (Sup-ViT). Unlike previous loss landscape visualizations of neural networks based on classification task loss, we visualize the loss landscape of ViT by computing pre-training task loss.

9: Through the lens of loss landscapes, we find two interesting observations: (1) MAE-ViT has a smoother and wider overall loss curvature than Sup-ViT. (2) The EMA-teacher allows MAE to widen the region of convexity in both pretraining and linear probing, leading to quicker convergence.

10: To the best of our knowledge, this work is the first to investigate the self-supervised ViT through the lens of the loss landscape.

11:

12: \end{abstract}

13: