1c9eb1d164573691.tex
1: \begin{abstract}
2: 
3: \textit{Prompt-tuning} has emerged as an attractive paradigm for deploying large-scale language models due to its strong downstream task performance and efficient multitask serving ability.
4: Despite its wide adoption, we empirically show that prompt-tuning is vulnerable to downstream task-agnostic backdoors, which reside in the pretrained models and can affect arbitrary downstream tasks.
5: The state-of-the-art backdoor detection approaches cannot defend against task-agnostic backdoors since they hardly converge in reversing the backdoor triggers.
6: To address this issue, we propose \method, a novel approach for detecting and removing task-agnostic backdoors on Transformer models.
7: Instead of directly inverting the triggers, \method aims to invert the \textit{predefined attack vectors} (pretrained models' output when the input is embedded with triggers) of the task-agnostic backdoors, which achieves much better convergence performance and backdoor detection accuracy.
8: \method further leverages prompt-tuning's property of freezing the pretrained model to perform accurate and fast output monitoring and input purging during the inference phase.
9: Extensive experiments on multiple language models and NLP tasks illustrate the effectiveness of \method.
10: For instance, \method achieves 92.8\% backdoor detection accuracy on 960 models and decreases the attack success rate to less than 1\% in most scenarios.\footnote{Code is available at \url{https://github.com/meng-wenlong/LMSanitator}.}
11: 
12: \end{abstract}
13: