abstract:ac1cf92866f08a43.tex

1: \begin{abstract}

2: Recent years, many applications

3: have been driven advances by the use of Machine Learning (ML).

4: Nowadays,

5: it is common to see industrial-strength machine learning jobs that involve millions of model parameters, terabytes of training data, and weeks of training.

6: Good efficiency, i.e., fast completion time of running a specific ML job, therefore, is a key feature of a successful ML system.

7: While the completion time of a long-running ML job is determined by the time required to reach model convergence,

8: practically that is also largely influenced by the values of various system settings.

9: In this paper, we contribute techniques towards building \emph{self-tuning parameter servers}.

10: Parameter Server (PS) is a popular system architecture for large-scale machine learning systems;

11: and by self-tuning we mean

12: while a long-running ML job is iteratively training the expert-suggested model,

13: the system is also iteratively learning which system setting is more efficient for that job and

14: applies it online.

15: While our techniques are general enough to various PS-style ML systems,

16: we have prototyped our techniques

17: %, namely,

18: %(1) online job optimization,

19: %(2) online job progress estimation framework, and

20: %(3) online system reconfiguration,

21: on top of {\sf TensorFlow}.

22: Experiments show that our techniques can reduce the completion times of

23: a variety of long-running {\sf  TensorFlow} jobs from 1.4$\times$ to 18$\times$.

24: \end{abstract}

25: