410ddec26eb29669.tex
1: \begin{abstract}
2: Pretrained language models~(PLMs) are today the primary model for natural language processing. 
3: Despite their impressive downstream performance, it can be difficult to apply PLMs to new languages, a barrier to making their capabilities universally accessible.
4: While prior work has shown it possible to address this issue by learning a new embedding layer for the new language, doing so is both data and compute inefficient.
5: We propose to use an \textit{active forgetting} mechanism during pretraining, as a simple way of creating PLMs that can quickly adapt to new languages.
6: Concretely, by resetting the embedding layer every $K$ updates during pretraining, we encourage the PLM to improve its ability of learning new embeddings within limited number of updates, similar to a meta-learning effect.
7: Experiments with RoBERTa show that models pretrained with our forgetting mechanism not only demonstrate faster convergence during language adaptation, but also outperform standard ones in a low-data regime, particularly for languages that are distant from English. Code will be available at \url{https://github.com/facebookresearch/language-model-plasticity}.
8: \end{abstract}
9: