abstract:b633f16156d7e48f.tex

1: \begin{abstract} In this paper we  study the problem of minimizing the average of a large number ($n$) of smooth convex loss functions. We propose a new method, S2GD  (Semi-Stochastic Gradient Descent), which runs for one or several epochs in each of which a single full gradient and a random number of stochastic gradients is computed, following a geometric law. The total work needed for the method to output an $\varepsilon$-accurate solution in expectation, measured in the number of passes over data, or equivalently, in units equivalent to the computation of a single gradient of the empirical loss, is $O((n / \kappa)\log(1/\varepsilon))$, where $\kappa$ is the condition number. This is achieved by running the method for  $O(\log(1/\varepsilon))$ epochs,  with a single gradient evaluation and $O(\kappa)$ stochastic gradient evaluations in each. The SVRG method of Johnson and Zhang \cite{svrg} arises as a special case. If our method is limited to a single epoch only,  it needs to evaluate at most $O((\kappa/\varepsilon)\log(1/\varepsilon))$ stochastic gradients. In contrast, SVRG requires $O(\kappa/\varepsilon^2)$ stochastic gradients. To illustrate our theoretical results, S2GD only needs the workload equivalent to about 2.1 full gradient evaluations to find an $10^{-6}$-accurate solution for a problem with $n=10^9$ and $\kappa=10^3$.

2:

3:

4:

5:

6: %an algorithm which needs $O(\log (1/\varepsilon))$ evaluations of the gradient and $\kappa \log(1/\varepsilon)$ evaluations of the stochastic gradient to output an $\varepsilon$-optimal solution of an unconstrained smooth strongly convex  problem

7: %

8: %. In each epoch we first compute the full gradient, and then a random number (following a geometric law) of stochastic gradients. We obtain the best known linear rate of convergence in expectation.

9: \end{abstract}

10: