1: \begin{abstract}
2: Model aggregation -- the process that updates model parameters -- is an
3: important step for model convergence in distributed deep learning (DDL).
4: However, the parameter server (PS), a popular paradigm of performing
5: model aggregation, causes CPU underutilization in deep learning (DL)
6: clusters, due to the bursty nature of aggregation and static resource allocation.
7: To remedy this problem, we propose \emph{Parameter Service}, an elastic model
8: aggregation framework for DDL training, which decouples the function of model
9: aggregation from individual training jobs and provides a shared model
10: aggregation service to all jobs in the cluster.
11: In {\pservice}, model aggregations are efficiently packed and dynamically
12: migrated to fit into the available CPUs with negligible time overhead.
13: Furthermore, {\pservice} can elastically manage its CPU resources based on
14: its load to enhance resource efficiency.
15: We have implemented {\pservice} in a prototype system called {\autops} and evaluated
16: it via testbed experimentation and trace-driven simulations.
17: {\autops} reduces up to $75\%$ of CPU consumption with little or no performance
18: impact on the training jobs.
19: The design of {\pservice} is transparent to the users and can be incorporated
20: in popular DL frameworks.
21:
22: \end{abstract}
23: