abstract:5ab232f7aa9dca17.tex

1: \begin{abstract}

2: Fine-tuning and inference with large Language Models~({\lmabbr}) are generally known to be expensive.

3: Parameter-efficient fine-tuning over pretrained LMs reduces training memory by updating a small number of {\lmabbr} parameters but does not improve inference efficiency. Structured pruning improves {\lmabbr} inference efficiency by removing consistent parameter blocks, yet often increases training memory and time. To improve both training and inference efficiency, we introduce {\ourmethod} that adaptively {\it prunes} and {\it tunes} parameters for the {\lmabbr}s.

4: At the early stage of fine-tuning, {\ourmethod} dynamically adds {\it salient} tuning parameters for fast and accurate convergence while discarding unimportant parameters for efficiency.

5: Compared to baselines, our experiments show that {\ourmethod} maintains up to 98\% task performance when pruning 60\% of the parameters in RoBERTa and T5 models. APT also preserves 86.4\% of LLaMA models' performance with 70\% parameters remaining. Furthermore, {\ourmethod} speeds up LMs' fine-tuning by up to 8$\times$ and reduces large {\lmabbr}s' memory training footprint by up to 70\%. Our code and models are publicly available at \url{https://github.com/ROIM1998/APT}.

6: \end{abstract}

7: