e01929f534514344.tex
1: \begin{abstract}
2: Model aggregation -- the process that updates model parameters -- is an 
3: important step for model convergence in distributed deep learning (DDL).
4: However, the parameter server (PS), a popular paradigm of performing 
5: model aggregation, causes CPU underutilization in deep learning (DL) 
6: clusters, due to the bursty nature of aggregation and static resource allocation.
7: To remedy this problem, we propose \emph{Parameter Service}, an elastic model 
8: aggregation framework for DDL training, which decouples the function of model 
9: aggregation from individual training jobs and provides a shared model 
10: aggregation service to all jobs in the cluster.
11: In {\pservice}, model aggregations are efficiently packed and dynamically 
12: migrated to fit into the available CPUs with negligible time overhead.
13: Furthermore, {\pservice} can elastically manage its CPU resources based on 
14: its load to enhance resource efficiency.
15: We have implemented {\pservice} in a prototype system called {\autops} and evaluated 
16: it via testbed experimentation and trace-driven simulations.
17: {\autops} reduces up to $75\%$ of CPU consumption with little or no performance 
18: impact on the training jobs.
19: The design of {\pservice} is transparent to the users and can be incorporated 
20: in popular DL frameworks.
21: 
22: \end{abstract}
23: