1: \begin{abstract}
2: Here we develop variants of SGD (stochastic gradient descent) with an adaptive step size that make use of the sampled loss values. In particular, we focus on solving a finite sum-of-terms problem, also known as empirical risk minimization. We first detail an idealized adaptive method called \texttt{SPS}$_+$ that makes use of the sampled loss values and assumes knowledge of the sampled loss at optimality. This \texttt{SPS}$_+$ is a minor modification of the SPS (Stochastic Polyak Stepsize) method, where the step size is enforced to be positive. We then show that \texttt{SPS}$_+$ achieves the best known rates of convergence for SGD in the Lipschitz non-smooth. We then move onto to develop \FUVAL{}, a variant of \texttt{SPS}$_+$ where the loss values at optimality are gradually learned, as opposed to being given. We give three viewpoints of \FUVAL{}, as a projection based method, as a variant of the prox-linear method, and then as a particular online SGD method. We then present a convergence analysis of \FUVAL{} and experimental results. The shortcomings of our work is that the convergence analysis of \FUVAL{} shows no advantage over SGD. Another shortcomming is that currently only the full batch version of \FUVAL{} shows a minor advantages of GD (Gradient Descent) in terms of sensitivity to the step size. The stochastic version shows no clear advantage over SGD. We conjecture that large mini-batches are required to make \FUVAL{} competitive.
3:
4:
5: Currently the new \FUVAL{} method studied in this paper does not offer any clear theoretical or practical advantage.
6: We have chosen to make this draft available online nonetheless because of some of the analysis techniques we use, such as the non-smooth analysis of \texttt{SPS}$_+$, and also to show an apparently interesting approach that currently does not work.
7: \end{abstract}
8: