abstract:beaf421b7ab83b08.tex

1: \begin{abstract}

2: Stochastic gradient descent (SGD) is

3: the cornerstone of modern machine learning (ML) systems.

4: Despite its computational

5: efficiency, SGD requires

6: random data access that is inherently inefficient when implemented

7: in systems that rely on \emph{block-addressable secondary storage} such as

8: HDD and SSD, e.g., TensorFlow/PyTorch and in-DB ML systems over large files.

9: To address this impedance mismatch,

10: various data shuffling strategies have been proposed

11: to balance the convergence rate of SGD (which favors randomness)

12: and its I/O performance (which favors sequential access).

13:

14: In this paper, we first conduct a systematic empirical study on

15: existing data shuffling strategies, which reveals that

16: all existing strategies have room for improvement---they all suffer in terms of I/O performance \emph{or} convergence rate.

17: With this in mind, we propose a simple but novel \emph{hierarchical} data shuffling strategy, \sys.

18: Compared with existing strategies, \sys \emph{avoids} a full data shuffle while maintaining \emph{comparable convergence rate} of SGD as if a full shuffle were performed.

19: We provide a non-trivial theoretical

20: analysis of \sys on its convergence behavior. We further integrate \sys into PyTorch by designing new parallel/distributed shuffle operators inside a new \texttt{CorgiPileDataSet} API. We also integrate \sys into PostgreSQL by introducing

21: three new \textit{physical} operators with optimizations.

22: Our experimental results show that \sys

23: can achieve comparable convergence rate with the full shuffle based SGD for both deep learning and generalized linear models. For deep learning models on ImageNet dataset, \sys is 1.5$\times$ faster than PyTorch with full data shuffle. For in-DB ML with linear models, \sys is 1.6$\times$-12.8$\times$ faster than two state-of-the-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD.

24: \end{abstract}

25: