1: \begin{abstract}
2: Knowledge distillation is a popular approach for enhancing the performance of ``student'' models,
3: with lower representational capacity, by taking advantage of more powerful ``teacher'' models.
4: Despite its apparent simplicity and widespread use, the underlying mechanics behind knowledge distillation (KD) are still not fully understood.
5: In this work, we shed new light on the inner workings of this method, by examining it from an optimization perspective.
6: We show that, in the context of linear and deep linear models, KD can be interpreted as a novel type of stochastic variance reduction mechanism.
7: We provide a detailed convergence analysis of the resulting dynamics, which hold under standard assumptions for both strongly-convex and non-convex losses,
8: showing that KD acts as a form of \emph{partial variance reduction}, which can reduce the stochastic gradient noise, but may not eliminate it completely, depending on the properties of the ``teacher'' model.
9: Our analysis puts further emphasis on the need for careful parametrization of KD, in particular w.r.t. the weighting of the distillation loss,
10: and is validated empirically on both linear models and deep neural networks.
11: \end{abstract}
12: