abstract:ff6e9b1f02d47e04.tex

1: \begin{abstract}

2: Knowledge distillation is a popular approach for enhancing the performance of ``student'' models,

3: with lower representational capacity, by taking advantage of more powerful ``teacher'' models.

4: Despite its apparent simplicity and widespread use, the underlying mechanics behind knowledge distillation (KD) are still not fully understood.

5: In this work, we shed new light on the inner workings of this method, by examining it from an optimization perspective.

6: We show that, in the context of linear and deep linear models, KD can be interpreted as a novel type of stochastic variance reduction mechanism.

7: We provide a detailed convergence analysis of the resulting dynamics, which hold under standard assumptions for both strongly-convex and non-convex losses,

8: showing that KD acts as a form of \emph{partial variance reduction}, which can reduce the stochastic gradient noise, but may not eliminate it completely, depending on the properties of the ``teacher'' model.

9: Our analysis puts further emphasis on the need for careful parametrization of KD, in particular w.r.t. the weighting of the distillation loss,

10: and is validated empirically on both linear models and deep neural networks.

11: \end{abstract}

12: