492111bb8e874078.tex
1: \begin{abstract}
2: This paper presents a methodology for selecting the mini-batch size that minimizes Stochastic Gradient Descent (SGD) learning time for single and multiple learner problems. By decoupling algorithmic analysis issues from hardware and software implementation details, we reveal a robust empirical inverse law between mini-batch size and the average number of SGD updates required to converge to a specified error threshold.
3: Combining this empirical inverse law with measured system performance, we create an accurate, closed-form model of average training time and show how this model can be used to identify quantifiable implications for both algorithmic and hardware aspects of machine learning.
4: %Using this model, we explain that minimizing the time to compute an epoch, or any fixed number of updates, does not necessarily minimize the total training time because it neglects the dependence of convergence time on mini-batch size. 
5: We demonstrate the inverse law empirically, on both image recognition (MNIST,
6: CIFAR10 and CIFAR100) and machine translation (Europarl) tasks, and provide a theoretic
7: justification via proving a novel bound on mini-batch SGD training. 
8: % by providing specific guidance
9: % (1) to system designers on how to best allocate limited system
10: % resources for optimal SGD convergence time; and (2) to learning
11: % algorithm designers on which global algorithmic parameters drive optimal
12: % SGD convergence time.
13: % Using this
14: % relationship, we define an optimal data-parallel scaling method which
15: % outperforms both strong scaling and the commonly used weak scaling seen
16: % in much of machine learning scaling literature. 
17: \end{abstract}
18: