abstract:acd97fe0064db68f.tex

1: \begin{abstract}

2: Stochastic gradient descent (SGD) is a simple and popular method to solve stochastic optimization problems which arise in machine learning. For strongly convex problems, its convergence rate was known to be $\Ocal(\log(T)/T)$, by running SGD for $T$ iterations and returning the average point. However, recent results showed that using a different algorithm, one can get an optimal $\Ocal(1/T)$ rate. This might lead one to believe that standard SGD is suboptimal, and maybe should even be replaced as a method of choice. In this paper, we investigate the optimality of SGD in a stochastic setting. We show that for smooth problems, the algorithm attains the optimal $\Ocal(1/T)$ rate. However, for non-smooth problems, the convergence rate with averaging might really be $\Omega(\log(T)/T)$, and this is not just an artifact of the analysis. On the flip side, we show that a simple modification of the averaging step suffices to recover the $\Ocal(1/T)$ rate, and no other change of the algorithm is necessary. We also present experimental results which support our findings, and point out open problems.

3: \end{abstract}

4: