1: \begin{abstract}
2: Consider a number of workers running SGD independently on the same pool of data and averaging the models every once in a while --- a common but not well understood practice. We study model averaging as a variance-reducing mechanism and describe two ways in which the frequency of averaging affects convergence.
3: For convex objectives,
4: %we show that it is the gradient variance envelope that dictates whether frequent averaging is beneficial.
5: we show the benefit of frequent averaging depends on the gradient variance envelope.
6: For non-convex objectives, we illustrate that this benefit depends on the presence of multiple optimal points.
7: %has a desirable property when the goal is a low variance solution.
8: We complement our findings with multicore experiments on both synthetic and
9: real data.
10: \iflong
11: { \color{green}
12: On one extreme,
13: one takes a single average at the end of execution; a method referred to
14: as one-shot averaging. On the other extreme, models are averaged after
15: every step. This is equivalent to mini-batching. Intuitively, the former
16: is hardware efficient, while the latter can lead to convergence in fewer steps.
17: More generally, one can choose to average the models after any number
18: of steps -- a parameter that lets us explore the full spectrum of this
19: hardware efficiency vs. statistical efficiency trade-off.
20: The question then
21: becomes: how frequently should we average to optimize for wall clock
22: time? We share some analytic insight on the geometry of the objective
23: function. If the variance of evaluated gradients grows far
24: from the optimum, frequent averaging improves statistical efficiency.
25: Otherwise, it is as good as one-shot averaging, while incurring extra
26: communication costs at the expense hardware efficiency. We support these
27: insights in a set of experiments.
28: }
29: \fi
30:
31: \end{abstract}
32: