abstract:a4c43f99aeca05a0.tex

1: \begin{abstract}

2: Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity.

3: Previous VFMs rely on Image Foundation Models (IFMs),

4: which face challenges in transferring to the video domain.

5: Although VideoMAE has trained a robust ViT from limited data,

6: its low-level reconstruction poses convergence difficulties and conflicts with high-level cross-modal alignment.

7: This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods.

8: To increase data efficiency,

9: we mask out most of the low-semantics video tokens,

10: but selectively align the unmasked tokens with IFM,

11: which serves as the \textbf{U}n\textbf{M}asked \textbf{T}eacher (\textbf{UMT}).

12: By providing semantic guidance,

13: our method enables faster convergence and multimodal friendliness.

14: With a progressive pre-training framework,

15: our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.

16: Using only public sources for pre-training in \textbf{6 days} on \textbf{32 A100} GPUs,

17: our scratch-built ViT-L/16 achieves state-of-the-art performances on various video tasks.

18: The code and models will be released at \url{https://github.com/OpenGVLab/unmasked_teacher}.

19: \end{abstract}

20: