6d49e00be83a2793.tex
1: \begin{abstract}
2: 	Consider a number of workers running SGD independently on the same pool of data and averaging the models every once in a while --- a common but not well understood practice. We study model averaging as a variance-reducing mechanism and describe two ways in which the frequency of averaging affects convergence.
3:    For convex objectives, 
4:    %we show that it is the gradient variance envelope that dictates whether frequent averaging is beneficial.
5:    we show the benefit of frequent averaging depends on the gradient variance envelope.
6:    For non-convex objectives, we illustrate that this benefit depends on the presence of multiple optimal points.
7:    %has a desirable property when the goal is a low variance solution. 
8:    We complement our findings with multicore experiments on both synthetic and
9:    real data.
10:   \iflong
11:   { \color{green}
12:   On one extreme,  
13:   one takes a single average at the end of execution; a method referred to
14:   as one-shot averaging. On the other extreme, models are averaged after
15:   every step. This is equivalent to mini-batching. Intuitively, the former 
16:   is hardware efficient, while the latter can lead to convergence in fewer steps.
17:   More generally, one can choose to average the models after any number
18:   of steps -- a parameter that lets us explore the full spectrum of this
19:   hardware efficiency vs. statistical efficiency trade-off.
20:     The question then
21:   becomes: how frequently should we average to optimize for wall clock
22:   time? We share some analytic insight on the geometry of the objective
23:   function. If the variance of evaluated gradients grows far
24:   from the optimum, frequent averaging improves statistical efficiency.
25:   Otherwise, it is as good as one-shot averaging, while incurring extra 
26:   communication costs at the expense hardware efficiency. We support these
27:   insights in a set of experiments. 
28:   }
29:   \fi
30: 
31: \end{abstract}
32: