abstract:ab085f72b7ba9bc2.tex

1: \begin{abstract}

2:   Model quantification uses low bit-width values to represent the weight matrices of

3:   existing

4:   models

5:   to be quantized, which is a promising approach to reduce both storage and computational overheads of deploying highly anticipated LLMs. However, current quantization methods suffer severe performance degradation when the bit-width is extremely reduced, and thus focus on utilizing 4-bit or 8-bit values to quantize models.

6: This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs.

7: For this target, we introduce a 1-bit

8: % quantization-aware training (QAT)

9: model compressing

10: framework named OneBit, including a novel 1-bit parameter representation method to better quantize LLMs as well as an effective parameter initialization method based on matrix decomposition to improve the convergence speed of the

11: % QAT

12: quantization framework.

13: Sufficient experimental results indicate that OneBit achieves good performance (at least 81\% of the non-quantized performance on LLaMA models) with robust training processes when only using 1-bit weight matrices. Code and checkpoints are available at \url{https://github.com/xuyuzhuang11/OneBit}

14: \end{abstract}

15: