2bb5ad44358f5c00.tex
1: \begin{abstract}
2: 
3: \vspace{-10pt}
4: 
5: Knowledge distillation transfers knowledge from a large model to a small one via task and distillation losses. 
6: In this paper, we observe a \textbf{trade-off} between task and distillation losses, \ie, introducing distillation loss limits the convergence of task loss. 
7: We believe that the trade-off results from the \textit{insufficient} optimization of distillation loss. The reason is: The teacher has a lower task loss than the student, and a lower distillation loss drives the student more similar to the teacher, then a better-converged task loss could be obtained. 
8: To break the trade-off, we propose the Distillation-Oriented Trainer~(DOT). 
9: DOT separately considers gradients of task and distillation losses, then applies a larger momentum to distillation loss to accelerate its optimization. We empirically prove that DOT breaks the trade-off, \ie, both losses are sufficiently optimized.
10: Extensive experiments validate the superiority of DOT. Notably, DOT achieves a \textbf{+2.59\%} accuracy improvement on ImageNet-1k for the ResNet50-MobileNetV1 pair. Conclusively, DOT greatly benefits the student's optimization properties in terms of loss convergence and model generalization. Code will be made publicly available.
11: \end{abstract}
12: