abstract:63ac2ff0825e7699.tex

1: \begin{abstract}

2: Knowledge distillation, i.e. one classifier being trained on the

3: outputs of another classifier, is an empirically very successful

4: technique for knowledge transfer between classifiers. It has even

5: been observed that classifiers learn much faster and more reliably

6: if trained with the outputs of another classifier as soft labels,

7: instead of from ground truth data. So far, however, there is no

8: satisfactory theoretical explanation of this phenomenon. In this work,

9: we provide the first insights into the working mechanisms of

10: distillation by studying the special case of linear and deep

11: linear classifiers. Specifically, we prove a generalization

12: bound that establishes fast convergence of the expected risk of a

13: distillation-trained linear classifier. From the bound and its proof we

14: extract three key factors that determine the success of distillation:

15: \emph{data geometry} -- geometric properties of the data distribution,

16: in particular class separation, has an immediate influence on the

17: convergence speed of the risk;

18: \emph{optimization bias} -- gradient descent optimization finds a very

19: favorable minimum of the distillation objective;

20: and \emph{strong monotonicity} -- the expected risk of the student

21: classifier always decreases when the size of the training set grows.

22: \end{abstract}

23: