27cc4b8854300559.tex
1: \begin{abstract}
2:   We present a theoretical framework recasting data augmentation as stochastic
3:   optimization for a sequence of time-varying proxy losses. This provides a unified approach
4:   to understanding techniques commonly thought of as data augmentation, including
5:   synthetic noise and label-preserving transformations, as well as more traditional
6:   ideas in stochastic optimization such as learning rate and batch size scheduling.
7:   We prove a time-varying Monro-Robbins theorem with rates of convergence which gives
8:   conditions on the learning rate and augmentation schedule under which augmented
9:   gradient descent converges. Special cases give provably good joint schedules
10:   for augmentation with additive noise, minibatch SGD, and minibatch SGD with noise.
11: \end{abstract}
12: