abstract:64dc5e6c433d08ac.tex

1: \begin{abstract}

2: \looseness=-1

3: The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by \emph{dynamic, non-uniform} compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold.

4: Yet, current methods rely on heuristics for identifying the ``importance'' of a given layer towards the loss, based on assumptions such as \emph{error monotonicity}, i.e. that the end-to-end model compression error is proportional to the sum of layer-wise  errors.

5: In this paper, we revisit this area, and propose a new and general approach for dynamic compression that is provably optimal in a given input range.

6: We begin from the motivating observation that, in general, \emph{error monotonicity does not hold for LLMs}:

7: compressed models with lower sum of per-layer errors can perform \emph{worse} than models with higher error sums. To address this, we propose a new general evolutionary framework for dynamic LLM compression called \methodname{}, which has provable convergence, and low sample and evaluation complexity.

8: We show that these theoretical guarantees lead to highly competitive practical performance for dynamic compression of Llama, Mistral and Phi models. Via \methodname{}, we set new state-of-the-art results across all compression approaches: structural pruning (block/layer dropping), unstructured sparsity, as well as quantization with dynamic bitwidths. Our code is available at  \href{https://github.com/IST-DASLab/EvoPress}{https://github.com/IST-DASLab/EvoPress}.

9: \end{abstract}

10: