abstract:e01929f534514344.tex

1: \begin{abstract}

2: Model aggregation -- the process that updates model parameters -- is an

3: important step for model convergence in distributed deep learning (DDL).

4: However, the parameter server (PS), a popular paradigm of performing

5: model aggregation, causes CPU underutilization in deep learning (DL)

6: clusters, due to the bursty nature of aggregation and static resource allocation.

7: To remedy this problem, we propose \emph{Parameter Service}, an elastic model

8: aggregation framework for DDL training, which decouples the function of model

9: aggregation from individual training jobs and provides a shared model

10: aggregation service to all jobs in the cluster.

11: In {\pservice}, model aggregations are efficiently packed and dynamically

12: migrated to fit into the available CPUs with negligible time overhead.

13: Furthermore, {\pservice} can elastically manage its CPU resources based on

14: its load to enhance resource efficiency.

15: We have implemented {\pservice} in a prototype system called {\autops} and evaluated

16: it via testbed experimentation and trace-driven simulations.

17: {\autops} reduces up to $75\%$ of CPU consumption with little or no performance

18: impact on the training jobs.

19: The design of {\pservice} is transparent to the users and can be incorporated

20: in popular DL frameworks.

21:

22: \end{abstract}

23: