beaf421b7ab83b08.tex
1: \begin{abstract} 
2: Stochastic gradient descent (SGD) is
3: the cornerstone of modern machine learning (ML) systems.
4: Despite its computational 
5: efficiency, SGD requires 
6: random data access that is inherently inefficient when implemented 
7: in systems that rely on \emph{block-addressable secondary storage} such as 
8: HDD and SSD, e.g., TensorFlow/PyTorch and in-DB ML systems over large files.
9: To address this impedance mismatch, 
10: various data shuffling strategies have been proposed
11: to balance the convergence rate of SGD (which favors randomness) 
12: and its I/O performance (which favors sequential access).
13: 
14: In this paper, we first conduct a systematic empirical study on 
15: existing data shuffling strategies, which reveals that 
16: all existing strategies have room for improvement---they all suffer in terms of I/O performance \emph{or} convergence rate. 
17: With this in mind, we propose a simple but novel \emph{hierarchical} data shuffling strategy, \sys.
18: Compared with existing strategies, \sys \emph{avoids} a full data shuffle while maintaining \emph{comparable convergence rate} of SGD as if a full shuffle were performed. 
19: We provide a non-trivial theoretical 
20: analysis of \sys on its convergence behavior. We further integrate \sys into PyTorch by designing new parallel/distributed shuffle operators inside a new \texttt{CorgiPileDataSet} API. We also integrate \sys into PostgreSQL by introducing 
21: three new \textit{physical} operators with optimizations.
22: Our experimental results show that \sys
23: can achieve comparable convergence rate with the full shuffle based SGD for both deep learning and generalized linear models. For deep learning models on ImageNet dataset, \sys is 1.5$\times$ faster than PyTorch with full data shuffle. For in-DB ML with linear models, \sys is 1.6$\times$-12.8$\times$ faster than two state-of-the-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD.
24: \end{abstract}
25: