abstract:39cb951dc7049255.tex

1: \begin{abstract}

2: The great success of deep learning heavily relies on increasingly larger training data, which comes at a price of huge computational and infrastructural costs. This poses crucial questions that, do all training data contribute to model's performance? How much does each individual training sample or a sub-training-set affect the model's generalization, and how to construct the \textit{smallest} subset from the entire training data as a proxy training set without significantly sacrificing the model's performance? To answer these, we propose \textit{dataset pruning}, an optimization-based sample selection method that can (1) examine the influence of removing a particular set of training samples on model's generalization ability with theoretical guarantee, and (2) construct the \textit{smallest subset} of training data that yields strictly constrained generalization gap. The empirically observed generalization gap of dataset pruning is substantially consistent with our theoretical expectations. Furthermore, the proposed method prunes 40\% training examples on the CIFAR-10 dataset, halves the convergence time with only 1.3\% test accuracy decrease, which is superior to previous score-based sample selection methods.

3:

4:

5:

6: \end{abstract}

7: