1: \begin{abstract}
2: This paper presents a principled methodology for optimally selecting mini-batch size to minimize Stochastic Gradient Descent (SGD) learning time for single and multiple learner problems. By decoupling algorithmic analysis issues from hardware and software implementation details, we discovered a robust empirical inverse law between mini-batch size and the average number of SGD updates required to converge to a specified error threshold. In doing so, we introduce a new concept, that of algorithmic ``noise sensitivity'', which can be used to guide algorithmic exploration. Combining this empirical inverse law with measured system performance, we create an accurate model of average training time which allows us to identify quantifiable implications for both algorithmic and implementation aspects of machine learning. Using this model, we explain that minimizing the time to compute an epoch, or any fixed number of updates, does not necessarily minimize the total training time because it neglects the dependence of convergence time on mini-batch size. We demonstrate the inverse law empirically, using the MNIST,
3: CIFAR10 and CIFAR100 data sets, and provide a theoretic justification
4: via proving a novel bound on mini-batch SGD training.
5: % by providing specific guidance
6: % (1) to system designers on how to best allocate limited system
7: % resources for optimal SGD convergence time; and (2) to learning
8: % algorithm designers on which global algorithmic parameters drive optimal
9: % SGD convergence time.
10: % Using this
11: % relationship, we define an optimal data-parallel scaling method which
12: % outperforms both strong scaling and the commonly used weak scaling seen
13: % in much of machine learning scaling literature.
14: \end{abstract}
15: