abstract:492111bb8e874078.tex

1: \begin{abstract}

2: This paper presents a methodology for selecting the mini-batch size that minimizes Stochastic Gradient Descent (SGD) learning time for single and multiple learner problems. By decoupling algorithmic analysis issues from hardware and software implementation details, we reveal a robust empirical inverse law between mini-batch size and the average number of SGD updates required to converge to a specified error threshold.

3: Combining this empirical inverse law with measured system performance, we create an accurate, closed-form model of average training time and show how this model can be used to identify quantifiable implications for both algorithmic and hardware aspects of machine learning.

4: %Using this model, we explain that minimizing the time to compute an epoch, or any fixed number of updates, does not necessarily minimize the total training time because it neglects the dependence of convergence time on mini-batch size.

5: We demonstrate the inverse law empirically, on both image recognition (MNIST,

6: CIFAR10 and CIFAR100) and machine translation (Europarl) tasks, and provide a theoretic

7: justification via proving a novel bound on mini-batch SGD training.

8: % by providing specific guidance

9: % (1) to system designers on how to best allocate limited system

10: % resources for optimal SGD convergence time; and (2) to learning

11: % algorithm designers on which global algorithmic parameters drive optimal

12: % SGD convergence time.

13: % Using this

14: % relationship, we define an optimal data-parallel scaling method which

15: % outperforms both strong scaling and the commonly used weak scaling seen

16: % in much of machine learning scaling literature.

17: \end{abstract}

18: