ff6e9b1f02d47e04.tex
1: \begin{abstract}
2: Knowledge distillation is a popular approach for enhancing the performance of ``student'' models, 
3: with lower representational capacity, by taking advantage of more powerful ``teacher'' models. 
4: Despite its apparent simplicity and widespread use, the underlying mechanics behind knowledge distillation (KD) are still not fully understood. 
5: In this work, we shed new light on the inner workings of this method, by examining it from an optimization perspective. 
6: We show that, in the context of linear and deep linear models, KD can be interpreted as a novel type of stochastic variance reduction mechanism. 
7: We provide a detailed convergence analysis of the resulting dynamics, which hold under standard assumptions for both strongly-convex and non-convex losses, 
8: showing that KD acts as a form of \emph{partial variance reduction}, which can reduce the stochastic gradient noise, but may not eliminate it completely, depending on the properties of the ``teacher'' model. 
9: Our analysis puts further emphasis on the need for careful parametrization of KD, in particular w.r.t. the weighting of the distillation loss, 
10: and is validated empirically on both linear models and deep neural networks. 
11: \end{abstract}
12: