63ac2ff0825e7699.tex
1: \begin{abstract}
2: Knowledge distillation, i.e. one classifier being trained on the 
3: outputs of another classifier, is an empirically very successful 
4: technique for knowledge transfer between classifiers. It has even 
5: been observed that classifiers learn much faster and more reliably 
6: if trained with the outputs of another classifier as soft labels, 
7: instead of from ground truth data. So far, however, there is no 
8: satisfactory theoretical explanation of this phenomenon. In this work, 
9: we provide the first insights into the working mechanisms of 
10: distillation by studying the special case of linear and deep 
11: linear classifiers. Specifically, we prove a generalization 
12: bound that establishes fast convergence of the expected risk of a 
13: distillation-trained linear classifier. From the bound and its proof we 
14: extract three key factors that determine the success of distillation: 
15: \emph{data geometry} -- geometric properties of the data distribution, 
16: in particular class separation, has an immediate influence on the 
17: convergence speed of the risk;
18: \emph{optimization bias} -- gradient descent optimization finds a very 
19: favorable minimum of the distillation objective; 
20: and \emph{strong monotonicity} -- the expected risk of the student 
21: classifier always decreases when the size of the training set grows.
22: \end{abstract}
23: