abstract:1136bfe6d836a4ff.tex

1: \begin{abstract}

2: Masked image modeling (MIM) has shown great promise for self-supervised learning (SSL) yet been criticized for learning inefficiency.

3: We believe the insufficient utilization of training signals should be responsible.

4: To alleviate this issue, we introduce a conceptually simple yet learning-efficient MIM training scheme, termed \textit{D}isjoint \textit{M}asking with \textit{J}oint \textit{D}istillation (DMJD).

5: For disjoint masking (DM), we sequentially sample multiple masked views per image in a mini-batch with the disjoint regulation to raise the usage of tokens for reconstruction in each image while keeping the masking rate of each view.

6: For joint distillation (JD), we adopt a dual branch architecture to respectively predict invisible (masked) and visible (unmasked) tokens with superior learning targets.

7: Rooting in orthogonal perspectives for training efficiency improvement, DM and JD cooperatively accelerate the training convergence yet not sacrificing the model generalization ability.

8: Concretely, DM can train ViT with half of the effective training epochs ($3.7\times$ less time-consuming) to report competitive performance. With JD, our DMJD clearly improves the linear probing classification accuracy over ConvMAE by 5.8$\%$.

9: On fine-grained downstream tasks like semantic segmentation, object detection, \textit{etc.}, our DMJD also presents superior generalization compared with state-of-the-art SSL methods.

10: The code and model will be made public at \href{https://github.com/mx-mark/DMJD}{https://github.com/mx-mark/DMJD}.

11: \end{abstract}

12: