abstract:76744099fd7a497e.tex

1: \begin{abstract}

2: The classical analysis of Stochastic Gradient Descent (SGD) with polynomially decaying stepsize $\eta_t = \eta/\sqrt{t}$ relies on well-tuned $\eta$ depending on  problem parameters such as Lipschitz smoothness constant, which is often unknown in practice.

3: In this work, we prove that SGD with arbitrary $\eta > 0$, referred to as \textit{untuned SGD}, still attains an order-optimal convergence rate $\widetilde{\cO}(T^{-1/4})$ in terms of gradient norm for minimizing smooth objectives.

4: Unfortunately, it comes at the expense of a catastrophic exponential dependence on the smoothness constant, which we show is unavoidable for this scheme even in the noiseless setting. We then examine three families of adaptive methods --- Normalized SGD (NSGD), AMSGrad, and AdaGrad --- unveiling their power in preventing such exponential dependency in  the absence of information about the smoothness parameter and boundedness of stochastic gradients.

5: Our results provide  theoretical justification for the advantage of adaptive methods over untuned SGD in alleviating the issue with large gradients.

6:

7: \end{abstract}

8: