76744099fd7a497e.tex
1: \begin{abstract}
2: The classical analysis of Stochastic Gradient Descent (SGD) with polynomially decaying stepsize $\eta_t = \eta/\sqrt{t}$ relies on well-tuned $\eta$ depending on  problem parameters such as Lipschitz smoothness constant, which is often unknown in practice.  
3: In this work, we prove that SGD with arbitrary $\eta > 0$, referred to as \textit{untuned SGD}, still attains an order-optimal convergence rate $\widetilde{\cO}(T^{-1/4})$ in terms of gradient norm for minimizing smooth objectives.  
4: Unfortunately, it comes at the expense of a catastrophic exponential dependence on the smoothness constant, which we show is unavoidable for this scheme even in the noiseless setting. We then examine three families of adaptive methods --- Normalized SGD (NSGD), AMSGrad, and AdaGrad --- unveiling their power in preventing such exponential dependency in  the absence of information about the smoothness parameter and boundedness of stochastic gradients. 
5: Our results provide  theoretical justification for the advantage of adaptive methods over untuned SGD in alleviating the issue with large gradients.  
6: 
7: \end{abstract}
8: