99424837514308f7.tex
1: \begin{abstract}
2: \begin{comment}
3: deployment challenge 
4: -> quantization 
5: 
6: -> 4 bit as a standard, sub4bits, huge accuracy loss compared to full precision. 
7: ( -> exsiting works PTQ + QAT, still a lot of potential for improvement)-> 
8: 
9: we propose xxx to unleash xxx, particularly in extreme low bit (3,2) settings    
10: \end{comment}
11: 
12: The upscaling of Large Language Models (LLMs) has yielded impressive advances in natural language processing, yet it also poses significant deployment challenges.
13: Weight quantization has emerged as a widely embraced solution to reduce memory and computational demands.  
14: This paper introduces BitDistiller, a framework that synergizes Quantization-Aware Training (QAT) with Knowledge Distillation (KD) to boost the performance of LLMs at ultra-low precisions (sub-4-bit).
15: Specifically, BitDistiller first incorporates a tailored asymmetric quantization and clipping technique to maximally preserve the fidelity of quantized weights, and then proposes a novel Confidence-Aware Kullback-Leibler Divergence (CAKLD) objective, which is employed in a self-distillation manner to enable faster convergence and superior model performance.
16: Empirical evaluations demonstrate that BitDistiller significantly surpasses existing methods in both 3-bit and 2-bit configurations on general language understanding and complex reasoning benchmarks.
17: Notably, BitDistiller is shown to be more cost-effective, demanding fewer data and training resources. The code is available at \url{https://github.com/DD-DuDa/BitDistiller}.
18: 
19: %BitDistiller tackles two fundamental challenges in low-bit QAT with KD: preserving the fidelity of quantized weights and effectively transferring knowledge in distillation. 
20: 
21: 
22: \end{abstract}
23: