1: \begin{abstract}
2: Model quantification uses low bit-width values to represent the weight matrices of
3: existing
4: models
5: to be quantized, which is a promising approach to reduce both storage and computational overheads of deploying highly anticipated LLMs. However, current quantization methods suffer severe performance degradation when the bit-width is extremely reduced, and thus focus on utilizing 4-bit or 8-bit values to quantize models.
6: This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs.
7: For this target, we introduce a 1-bit
8: % quantization-aware training (QAT)
9: model compressing
10: framework named OneBit, including a novel 1-bit parameter representation method to better quantize LLMs as well as an effective parameter initialization method based on matrix decomposition to improve the convergence speed of the
11: % QAT
12: quantization framework.
13: Sufficient experimental results indicate that OneBit achieves good performance (at least 81\% of the non-quantized performance on LLaMA models) with robust training processes when only using 1-bit weight matrices. Code and checkpoints are available at \url{https://github.com/xuyuzhuang11/OneBit}
14: \end{abstract}
15: