1: \begin{abstract}
2: We present a theoretical framework recasting data augmentation as stochastic
3: optimization for a sequence of time-varying proxy losses. This provides a unified approach
4: to understanding techniques commonly thought of as data augmentation, including
5: synthetic noise and label-preserving transformations, as well as more traditional
6: ideas in stochastic optimization such as learning rate and batch size scheduling.
7: We prove a time-varying Monro-Robbins theorem with rates of convergence which gives
8: conditions on the learning rate and augmentation schedule under which augmented
9: gradient descent converges. Special cases give provably good joint schedules
10: for augmentation with additive noise, minibatch SGD, and minibatch SGD with noise.
11: \end{abstract}
12: