afa7cd4de1f4942e.tex
1: \begin{abstract}
2: In this paper, we propose a simple variant of the original stochastic \emph{variance reduction} gradient (SVRG)~\cite{johnson:svrg}, where hereafter we refer to as the \emph{variance reduced stochastic gradient descent} (VR-SGD). Different from the choices of the \emph{snapshot point} and \emph{starting point} in SVRG and its proximal variant, Prox-SVRG~\cite{xiao:prox-svrg}, the two vectors of each epoch in VR-SGD are set to the \emph{average} and \emph{last iterate} of the previous epoch, respectively. This setting allows us to use much larger learning rates or step sizes than SVRG, e.g., $3/(7L)$ for VR-SGD vs.\ $1/(10L)$ for SVRG, and also makes our convergence analysis more challenging. In fact, a larger learning rate enjoyed by VR-SGD means that the variance of its stochastic gradient estimator asymptotically approaches zero more rapidly. Unlike common stochastic methods such as SVRG and proximal stochastic methods such as Prox-SVRG, we design two different update rules for \emph{smooth} and \emph{non-smooth} objective functions, respectively. In other words, VR-SGD can tackle non-smooth and/or non-strongly convex problems directly without using any reduction techniques such as quadratic regularizers. Moreover, we analyze the \emph{convergence properties} of VR-SGD for \emph{strongly convex} problems, which show that VR-SGD attains a \emph{linear} convergence rate. We also provide the \emph{convergence guarantees} of VR-SGD for \emph{non-strongly convex} problems. Experimental results show that the performance of VR-SGD is significantly better than its counterparts, SVRG and Prox-SVRG, and it is also much better than the \emph{best known} stochastic method, Katyusha~\cite{zhu:Katyusha}.
3: \end{abstract}