abstract:07bfed9042a1ce3c.tex

1: \begin{abstract}

2: %-------------------------------------------------------------------------------

3: %\vspace{-0.1cm}

4: Deep Neural Network (DNN) models have continuously been growing in size in order to improve the accuracy and quality of the models.

5: Moreover, for training of large DNN models, the use of heterogeneous GPUs is inevitable due to the short release cycle of new GPU architectures.

6: In this paper, we investigate how to enable training of large DNN models on a heterogeneous GPU cluster that possibly includes whimpy GPUs that, as a standalone, could not be used for training.

7: We present a DNN training system, {\it HetPipe}

8: ({\it Het}erogeneous {\it Pipe}line), that integrates pipelined model parallelism (PMP) with data parallelism (DP).

9: In HetPipe, a group of multiple GPUs, called a {\it \VW}, processes minibatches in a pipelined manner, and  multiple such virtual workers employ data parallelism for higher performance.

10: We also propose a novel parameter synchronization model, which we refer to as Wave Synchronous Parallel (WSP) to accommodate both PMP and DP for {\MW}, and provide convergence proof of WSP.

11: Our experimental results on a given heterogeneous setting show that

12: with HetPipe, DNN models converge up to 49\% faster compared to

13: the state-of-the-art DP technique.

14: %\vspace{-0.1cm}

15: %-------------------------------------------------------------------------------

16: \end{abstract}

17: