e712b680c54e6e39.tex
1: \begin{abstract}
2:   This paper theoretically investigates the following empirical phenomenon:
3:   given a high-complexity network with poor generalization bounds, one can \emph{distill}
4:   it into a network with nearly identical predictions but low complexity and vastly smaller 
5:   generalization bounds.
6:   The main contribution is an analysis showing that the original network
7:   inherits this good generalization bound from its distillation, assuming the use of well-behaved data augmentation.
8:   This bound is presented both in an abstract and in a concrete form, the latter complemented
9:   by a reduction technique to handle modern computation graphs featuring convolutional layers, fully-connected
10:   layers, and skip connections, to name a few.
11:   To round out the story, a (looser) classical uniform convergence analysis of compression is also presented,
12:   as well as a variety of experiments on \cifar and \mnist demonstrating
13:   similar generalization performance
14:   between the original network and its distillation.
15: \end{abstract}
16: