1: \begin{abstract}
2: Recently, the stochastic Polyak step size (\texttt{SPS}) has emerged as a competitive adaptive step size scheme for stochastic gradient descent. Here we develop \texttt{ProxSPS}, a \textit{proximal} variant of \texttt{SPS} that can handle regularization terms. Developing a proximal variant of \texttt{SPS} is particularly important, since \texttt{SPS} requires a lower bound of the objective function to work well. When the objective function is the sum of a loss and a regularizer, available estimates of a lower bound of the sum can be loose. In contrast, \texttt{ProxSPS} only requires a lower bound for the loss which is often readily available.
3: % easier to estimate.
4: As a consequence, we show that \texttt{ProxSPS} is easier to tune and more stable in the presence of regularization. Furthermore for image classification tasks, \texttt{ProxSPS} performs as well as \texttt{AdamW} with little to no tuning, and results in a network with smaller weight parameters.
5: %while being less sensitive to the regularization strength and resulting in a network with smaller weight parameters.
6: We also provide an extensive convergence analysis for \texttt{ProxSPS} that includes the non-smooth, smooth, weakly convex and strongly convex setting.
7: %In particular, we focus on squared $\ell_2$-regularization which is popular in deep learning and for which \texttt{ProxSPS} has closed-form updates. We show empirically that a naive handling of regularization can lead to suboptimal results for adaptive step size schemes.
8: \end{abstract}
9: