abstract:4e56c96dfe6113ff.tex

1: \begin{abstract}

2: Here we develop variants of SGD (stochastic gradient descent) with an adaptive step size that make use of the sampled loss values.  In particular, we focus on solving a finite sum-of-terms problem, also known as empirical risk minimization.  We first detail an idealized adaptive method called \texttt{SPS}$_+$ that makes use of the sampled loss values and assumes knowledge of the sampled loss at optimality.  This \texttt{SPS}$_+$ is a minor modification of the SPS (Stochastic Polyak Stepsize) method,  where the step size is enforced to be positive.  We then show that \texttt{SPS}$_+$ achieves the best known rates of convergence for SGD in the Lipschitz non-smooth. We then move onto to develop \FUVAL{}, a variant of \texttt{SPS}$_+$ where the loss values at optimality are gradually learned, as opposed to being given.  We give three viewpoints of \FUVAL{},  as a projection based method, as a variant of the prox-linear method, and then as a particular online SGD method. We then present a convergence analysis of \FUVAL{} and experimental results. The shortcomings of our work is that the convergence analysis of \FUVAL{} shows no advantage over SGD. Another shortcomming is that currently only the full batch version of \FUVAL{} shows a minor advantages of GD (Gradient Descent) in terms of sensitivity to the step size. The stochastic version shows no clear advantage over SGD. We conjecture that large mini-batches are required to make \FUVAL{} competitive.

3:

4:

5: Currently the new \FUVAL{} method studied in this paper does not offer any clear theoretical or practical advantage.

6: We have chosen to make this draft available online nonetheless because of some of the analysis techniques we use, such as the non-smooth analysis of \texttt{SPS}$_+$, and also to show  an apparently interesting approach that currently does not work.

7: \end{abstract}

8: