224c5a45406c72d5.tex
1: \begin{abstract}
2: 
3: \let\thefootnote\relax\footnotetext{$\dagger$ Work done during internship in ARC Lab, Tencent PCG. \\ {\hspace*{1.5em} $*$ Corresponding author}}
4: 
5: 
6: 
7:    The state of the arts in vision-language pretraining (VLP) achieves exemplary performance 
8:   %
9:   but suffers from high training costs resulting from slow convergence and long training time, especially on large-scale web datasets. 
10:   %
11:   An essential obstacle to training efficiency lies in the entangled prediction rate {\color{black}(percentage of tokens for reconstruction)} and corruption rate {\color{black}(percentage of corrupted  tokens)} in masked language modeling (MLM), 
12:   that is, a proper corruption rate is achieved at the cost of a large portion of output tokens being excluded from prediction loss.  
13:   %
14: To accelerate the convergence of VLP, we propose a new pretraining task, namely, free language modeling (FLM), that enables a 100\% prediction rate with arbitrary corruption rates.
15: FLM successfully frees the prediction rate from the tie-up with the corruption rate while allowing the corruption spans to be customized for each token to be predicted.
16: FLM-trained models are encouraged to learn better and faster given the same GPU time by exploiting bidirectional contexts more flexibly.
17: %
18: Extensive experiments show FLM could achieve an impressive $2.5\times$ pretraining time reduction in comparison to the MLM-based methods, while keeping competitive performance on both vision-language understanding and generation tasks. Code will be public at~\url{https://github.com/TencentARC/FLM}.
19: 
20: \end{abstract}
21: