abstract:3420159aebaf4b08.tex

1: \begin{abstract}

2: %-------------------------------------------------------------------------------

3: It is a challenging task to train large DNN models on sophisticated

4: GPU platforms with diversified interconnect capabilities.

5: Recently, pipelined training has been proposed

6: as an effective approach for improving device utilization.

7: However, there are still several tricky issues to address:

8: improving computing efficiency while ensuring convergence, and

9: reducing memory usage without incurring additional computing costs.

10: We propose \emph{DAPPLE}, a synchronous training framework which combines

11: data parallelism and pipeline parallelism for large DNN models.

12: It features a novel parallelization strategy \emph{planner} %for synchronous training(friendly for model convergence)

13: to solve the partition and placement problems, and explores the optimal hybrid strategies of data and pipeline parallelism.

14: We also propose a new runtime scheduling algorithm to reduce device

15: memory usage, which is orthogonal to re-computation approach and does not come

16: at the expense of training throughput.

17: Experiments show that \emph{DAPPLE planner} consistently outperforms strategies generated by PipeDream‘s planner by up to $3.23\times$ speedup under synchronous training scenarios, and \emph{DAPPLE runtime} outperforms GPipe by $1.6\times$ speedup of training throughput and saves 12\% of memory consumption at the same time.

18: %given a fixed global batch size,

19: % \emph{DAPPLE} outperforms the best data parallelism baselines with 1.71X/1.37X/1.79X % (up to 2.32X for GNMT-16% on \cb{config $C$})

20: % training speedups on three typical cluster environments.

21: % Note: these number is calculated when **GBS=128** for all models

22:

23: \end{abstract}

24: