1: \begin{abstract}
2: The classical analysis of Stochastic Gradient Descent (SGD) with polynomially decaying stepsize $\eta_t = \eta/\sqrt{t}$ relies on well-tuned $\eta$ depending on problem parameters such as Lipschitz smoothness constant, which is often unknown in practice.
3: In this work, we prove that SGD with arbitrary $\eta > 0$, referred to as \textit{untuned SGD}, still attains an order-optimal convergence rate $\widetilde{\cO}(T^{-1/4})$ in terms of gradient norm for minimizing smooth objectives.
4: Unfortunately, it comes at the expense of a catastrophic exponential dependence on the smoothness constant, which we show is unavoidable for this scheme even in the noiseless setting. We then examine three families of adaptive methods --- Normalized SGD (NSGD), AMSGrad, and AdaGrad --- unveiling their power in preventing such exponential dependency in the absence of information about the smoothness parameter and boundedness of stochastic gradients.
5: Our results provide theoretical justification for the advantage of adaptive methods over untuned SGD in alleviating the issue with large gradients.
6:
7: \end{abstract}
8: