abstract:e712b680c54e6e39.tex

1: \begin{abstract}

2:   This paper theoretically investigates the following empirical phenomenon:

3:   given a high-complexity network with poor generalization bounds, one can \emph{distill}

4:   it into a network with nearly identical predictions but low complexity and vastly smaller

5:   generalization bounds.

6:   The main contribution is an analysis showing that the original network

7:   inherits this good generalization bound from its distillation, assuming the use of well-behaved data augmentation.

8:   This bound is presented both in an abstract and in a concrete form, the latter complemented

9:   by a reduction technique to handle modern computation graphs featuring convolutional layers, fully-connected

10:   layers, and skip connections, to name a few.

11:   To round out the story, a (looser) classical uniform convergence analysis of compression is also presented,

12:   as well as a variety of experiments on \cifar and \mnist demonstrating

13:   similar generalization performance

14:   between the original network and its distillation.

15: \end{abstract}

16: