1: \begin{abstract}
2: We present a differentiable joint pruning and quantization (DJPQ) scheme. We frame neural network compression as a joint gradient-based optimization problem, trading off between model pruning and quantization automatically for hardware efficiency. DJPQ incorporates variational information bottleneck based structured pruning and mixed-bit precision quantization into a single differentiable loss function. In contrast to previous works which consider pruning and quantization separately, our method enables users to find the optimal trade-off between both in a single training procedure. To utilize the method for more efficient hardware inference, we extend DJPQ to integrate structured pruning with power-of-two bit-restricted quantization.
3: %The scheme has potential advantages in convergence rate and training stability over other approaches.
4: We show that DJPQ significantly reduces the number of Bit-Operations (BOPs) for several networks while maintaining the top-1 accuracy of original floating-point models (e.g.,
5: 53x BOPs reduction in ResNet18 on ImageNet, 43x in MobileNetV2).
6: Compared to the conventional two-stage approach, which optimizes pruning and quantization independently, our scheme outperforms in terms of both accuracy and BOPs. Even when considering bit-restricted quantization, DJPQ achieves larger compression ratios and better accuracy than the two-stage approach.
7:
8:
9: %To take into account of hardware constraints, we further extend DJPQ at little extra cost to the case where quantization bits are restricted to power of two integers. The proposed scheme is able to learn mixed-precision bit-width for restricted scenarios under a unified framework. We show that DJPQ is able to achieve a large compression ratio even with bit restricted quantization.
10:
11: \keywords{Joint optimization, model compression, mixed precision, bit-restriction, variational information bottleneck, quantization}
12: \end{abstract}
13: