1: \begin{abstract}
2: This paper theoretically investigates the following empirical phenomenon:
3: given a high-complexity network with poor generalization bounds, one can \emph{distill}
4: it into a network with nearly identical predictions but low complexity and vastly smaller
5: generalization bounds.
6: The main contribution is an analysis showing that the original network
7: inherits this good generalization bound from its distillation, assuming the use of well-behaved data augmentation.
8: This bound is presented both in an abstract and in a concrete form, the latter complemented
9: by a reduction technique to handle modern computation graphs featuring convolutional layers, fully-connected
10: layers, and skip connections, to name a few.
11: To round out the story, a (looser) classical uniform convergence analysis of compression is also presented,
12: as well as a variety of experiments on \cifar and \mnist demonstrating
13: similar generalization performance
14: between the original network and its distillation.
15: \end{abstract}
16: