abstract:410ddec26eb29669.tex

1: \begin{abstract}

2: Pretrained language models~(PLMs) are today the primary model for natural language processing.

3: Despite their impressive downstream performance, it can be difficult to apply PLMs to new languages, a barrier to making their capabilities universally accessible.

4: While prior work has shown it possible to address this issue by learning a new embedding layer for the new language, doing so is both data and compute inefficient.

5: We propose to use an \textit{active forgetting} mechanism during pretraining, as a simple way of creating PLMs that can quickly adapt to new languages.

6: Concretely, by resetting the embedding layer every $K$ updates during pretraining, we encourage the PLM to improve its ability of learning new embeddings within limited number of updates, similar to a meta-learning effect.

7: Experiments with RoBERTa show that models pretrained with our forgetting mechanism not only demonstrate faster convergence during language adaptation, but also outperform standard ones in a low-data regime, particularly for languages that are distant from English. Code will be available at \url{https://github.com/facebookresearch/language-model-plasticity}.

8: \end{abstract}

9: