abstract:224c5a45406c72d5.tex

1: \begin{abstract}

2:

3: \let\thefootnote\relax\footnotetext{$\dagger$ Work done during internship in ARC Lab, Tencent PCG. \\ {\hspace*{1.5em} $*$ Corresponding author}}

4:

5:

6:

7:    The state of the arts in vision-language pretraining (VLP) achieves exemplary performance

8:   %

9:   but suffers from high training costs resulting from slow convergence and long training time, especially on large-scale web datasets.

10:   %

11:   An essential obstacle to training efficiency lies in the entangled prediction rate {\color{black}(percentage of tokens for reconstruction)} and corruption rate {\color{black}(percentage of corrupted  tokens)} in masked language modeling (MLM),

12:   that is, a proper corruption rate is achieved at the cost of a large portion of output tokens being excluded from prediction loss.

13:   %

14: To accelerate the convergence of VLP, we propose a new pretraining task, namely, free language modeling (FLM), that enables a 100\% prediction rate with arbitrary corruption rates.

15: FLM successfully frees the prediction rate from the tie-up with the corruption rate while allowing the corruption spans to be customized for each token to be predicted.

16: FLM-trained models are encouraged to learn better and faster given the same GPU time by exploiting bidirectional contexts more flexibly.

17: %

18: Extensive experiments show FLM could achieve an impressive $2.5\times$ pretraining time reduction in comparison to the MLM-based methods, while keeping competitive performance on both vision-language understanding and generation tasks. Code will be public at~\url{https://github.com/TencentARC/FLM}.

19:

20: \end{abstract}

21: