abstract:e118a015d6918e34.tex

1: \begin{abstract}

2: Recently, the stochastic Polyak step size (\texttt{SPS}) has emerged as a competitive adaptive step size scheme for stochastic gradient descent.  Here we develop \texttt{ProxSPS},  a \textit{proximal} variant of \texttt{SPS} that can handle regularization terms. Developing a proximal variant of \texttt{SPS} is particularly important, since \texttt{SPS} requires a lower bound of the objective function to work well. When the objective function is the sum of a loss and a regularizer, available estimates of a lower bound of the sum can be loose. In contrast, \texttt{ProxSPS} only requires a lower bound for the loss which is often readily available.

3: % easier to estimate.

4: As a consequence, we show that \texttt{ProxSPS} is easier to tune and more stable in the presence of regularization. Furthermore for image classification tasks, \texttt{ProxSPS} performs as well as \texttt{AdamW} with little to no tuning, and results in a network with smaller weight parameters.

5: %while being less sensitive to the regularization strength and resulting in a network with smaller weight parameters.

6: We also provide an extensive convergence analysis for \texttt{ProxSPS} that includes the non-smooth,  smooth, weakly convex and strongly convex setting.

7: %In particular, we focus on squared $\ell_2$-regularization which is popular in deep learning and for which \texttt{ProxSPS} has closed-form updates. We show empirically that a naive handling of regularization can lead to suboptimal results for adaptive step size schemes.

8: \end{abstract}

9: