abstract:1c9eb1d164573691.tex

1: \begin{abstract}

2:

3: \textit{Prompt-tuning} has emerged as an attractive paradigm for deploying large-scale language models due to its strong downstream task performance and efficient multitask serving ability.

4: Despite its wide adoption, we empirically show that prompt-tuning is vulnerable to downstream task-agnostic backdoors, which reside in the pretrained models and can affect arbitrary downstream tasks.

5: The state-of-the-art backdoor detection approaches cannot defend against task-agnostic backdoors since they hardly converge in reversing the backdoor triggers.

6: To address this issue, we propose \method, a novel approach for detecting and removing task-agnostic backdoors on Transformer models.

7: Instead of directly inverting the triggers, \method aims to invert the \textit{predefined attack vectors} (pretrained models' output when the input is embedded with triggers) of the task-agnostic backdoors, which achieves much better convergence performance and backdoor detection accuracy.

8: \method further leverages prompt-tuning's property of freezing the pretrained model to perform accurate and fast output monitoring and input purging during the inference phase.

9: Extensive experiments on multiple language models and NLP tasks illustrate the effectiveness of \method.

10: For instance, \method achieves 92.8\% backdoor detection accuracy on 960 models and decreases the attack success rate to less than 1\% in most scenarios.\footnote{Code is available at \url{https://github.com/meng-wenlong/LMSanitator}.}

11:

12: \end{abstract}

13: