1: \begin{abstract}
2: Recent years, many applications
3: have been driven advances by the use of Machine Learning (ML).
4: Nowadays,
5: it is common to see industrial-strength machine learning jobs that involve millions of model parameters, terabytes of training data, and weeks of training.
6: Good efficiency, i.e., fast completion time of running a specific ML job, therefore, is a key feature of a successful ML system.
7: While the completion time of a long-running ML job is determined by the time required to reach model convergence,
8: practically that is also largely influenced by the values of various system settings.
9: In this paper, we contribute techniques towards building \emph{self-tuning parameter servers}.
10: Parameter Server (PS) is a popular system architecture for large-scale machine learning systems;
11: and by self-tuning we mean
12: while a long-running ML job is iteratively training the expert-suggested model,
13: the system is also iteratively learning which system setting is more efficient for that job and
14: applies it online.
15: While our techniques are general enough to various PS-style ML systems,
16: we have prototyped our techniques
17: %, namely,
18: %(1) online job optimization,
19: %(2) online job progress estimation framework, and
20: %(3) online system reconfiguration,
21: on top of {\sf TensorFlow}.
22: Experiments show that our techniques can reduce the completion times of
23: a variety of long-running {\sf TensorFlow} jobs from 1.4$\times$ to 18$\times$.
24: \end{abstract}
25: