ab085f72b7ba9bc2.tex
1: \begin{abstract}
2:   Model quantification uses low bit-width values to represent the weight matrices of 
3:   existing 
4:   models 
5:   to be quantized, which is a promising approach to reduce both storage and computational overheads of deploying highly anticipated LLMs. However, current quantization methods suffer severe performance degradation when the bit-width is extremely reduced, and thus focus on utilizing 4-bit or 8-bit values to quantize models. 
6: This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs. 
7: For this target, we introduce a 1-bit 
8: % quantization-aware training (QAT) 
9: model compressing
10: framework named OneBit, including a novel 1-bit parameter representation method to better quantize LLMs as well as an effective parameter initialization method based on matrix decomposition to improve the convergence speed of the 
11: % QAT 
12: quantization framework.
13: Sufficient experimental results indicate that OneBit achieves good performance (at least 81\% of the non-quantized performance on LLaMA models) with robust training processes when only using 1-bit weight matrices. Code and checkpoints are available at \url{https://github.com/xuyuzhuang11/OneBit}
14: \end{abstract}
15: