1: \begin{abstract}
2: \begin{comment}
3: deployment challenge
4: -> quantization
5:
6: -> 4 bit as a standard, sub4bits, huge accuracy loss compared to full precision.
7: ( -> exsiting works PTQ + QAT, still a lot of potential for improvement)->
8:
9: we propose xxx to unleash xxx, particularly in extreme low bit (3,2) settings
10: \end{comment}
11:
12: The upscaling of Large Language Models (LLMs) has yielded impressive advances in natural language processing, yet it also poses significant deployment challenges.
13: Weight quantization has emerged as a widely embraced solution to reduce memory and computational demands.
14: This paper introduces BitDistiller, a framework that synergizes Quantization-Aware Training (QAT) with Knowledge Distillation (KD) to boost the performance of LLMs at ultra-low precisions (sub-4-bit).
15: Specifically, BitDistiller first incorporates a tailored asymmetric quantization and clipping technique to maximally preserve the fidelity of quantized weights, and then proposes a novel Confidence-Aware Kullback-Leibler Divergence (CAKLD) objective, which is employed in a self-distillation manner to enable faster convergence and superior model performance.
16: Empirical evaluations demonstrate that BitDistiller significantly surpasses existing methods in both 3-bit and 2-bit configurations on general language understanding and complex reasoning benchmarks.
17: Notably, BitDistiller is shown to be more cost-effective, demanding fewer data and training resources. The code is available at \url{https://github.com/DD-DuDa/BitDistiller}.
18:
19: %BitDistiller tackles two fundamental challenges in low-bit QAT with KD: preserving the fidelity of quantized weights and effectively transferring knowledge in distillation.
20:
21:
22: \end{abstract}
23: