1: \begin{abstract}
2: Knowledge distillation, i.e. one classifier being trained on the
3: outputs of another classifier, is an empirically very successful
4: technique for knowledge transfer between classifiers. It has even
5: been observed that classifiers learn much faster and more reliably
6: if trained with the outputs of another classifier as soft labels,
7: instead of from ground truth data. So far, however, there is no
8: satisfactory theoretical explanation of this phenomenon. In this work,
9: we provide the first insights into the working mechanisms of
10: distillation by studying the special case of linear and deep
11: linear classifiers. Specifically, we prove a generalization
12: bound that establishes fast convergence of the expected risk of a
13: distillation-trained linear classifier. From the bound and its proof we
14: extract three key factors that determine the success of distillation:
15: \emph{data geometry} -- geometric properties of the data distribution,
16: in particular class separation, has an immediate influence on the
17: convergence speed of the risk;
18: \emph{optimization bias} -- gradient descent optimization finds a very
19: favorable minimum of the distillation objective;
20: and \emph{strong monotonicity} -- the expected risk of the student
21: classifier always decreases when the size of the training set grows.
22: \end{abstract}
23: