efa30ee6231ac41e.tex
1: \begin{abstract}
2: Fully quantized training (FQT), which uses low-bitwidth hardware by quantizing the activations, weights, and gradients of a neural network model, is a promising approach to accelerate the training of deep neural networks. 
3: One major challenge with FQT is the lack of theoretical understanding, in particular of how gradient quantization impacts convergence properties. 
4: In this paper, we address this problem by presenting a statistical framework for analyzing  FQT algorithms. 
5: We view the quantized gradient of FQT as a stochastic estimator of its full precision counterpart, a procedure known as quantization-aware training (QAT).
6: We show that the FQT gradient is an unbiased estimator of the QAT gradient, and we discuss the impact of gradient quantization on its variance. 
7: Inspired by these theoretical results, we develop two novel gradient quantizers, and we show that these have smaller variance than the existing per-tensor quantizer. 
8: For training ResNet-50 on ImageNet, our 5-bit block Householder quantizer achieves only 0.5\% validation accuracy loss relative to QAT, comparable to the existing INT8 baseline. Our code is publicly available at \url{https://github.com/cjf00000/StatQuant}.
9: \end{abstract}
10: