a4c43f99aeca05a0.tex
1: \begin{abstract}
2: Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity. 
3: Previous VFMs rely on Image Foundation Models (IFMs), 
4: which face challenges in transferring to the video domain. 
5: Although VideoMAE has trained a robust ViT from limited data, 
6: its low-level reconstruction poses convergence difficulties and conflicts with high-level cross-modal alignment.
7: This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods.
8: To increase data efficiency,
9: we mask out most of the low-semantics video tokens,
10: but selectively align the unmasked tokens with IFM,
11: which serves as the \textbf{U}n\textbf{M}asked \textbf{T}eacher (\textbf{UMT}).
12: By providing semantic guidance,
13: our method enables faster convergence and multimodal friendliness.
14: With a progressive pre-training framework,
15: our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding. 
16: Using only public sources for pre-training in \textbf{6 days} on \textbf{32 A100} GPUs,
17: our scratch-built ViT-L/16 achieves state-of-the-art performances on various video tasks.
18: The code and models will be released at \url{https://github.com/OpenGVLab/unmasked_teacher}.
19: \end{abstract}
20: