3170b30760192f08.tex
1: \begin{abstract}
2:     We focus on the classification problem with a separable dataset, one of the most important and classical problems from machine learning.  The standard approach to this task is \emph{logistic regression with gradient descent} (LR+GD). Recent studies have observed that LR+GD can find a solution with arbitrarily large step sizes, defying conventional optimization theory. Our work investigates this phenomenon and makes three interconnected key observations about LR+GD with large step sizes.
3:     First, we find a remarkably simple explanation of why LR+GD with large step sizes solves the classification problem: LR+GD reduces to a batch version of the celebrated perceptron algorithm 
4:     when the step size $\gamma \to \infty.$ 
5:     Second, we observe that larger step sizes lead LR+GD to \emph{higher} logistic losses when it tends to the perceptron algorithm, but larger step sizes also lead to \emph{faster} convergence to a solution for the classification problem, meaning that logistic loss is an unreliable metric of the proximity to a solution. Surprisingly, high loss values can actually indicate faster convergence. Third, since the convergence rate in terms of loss function values of LR+GD is unreliable, we examine the iteration complexity required by LR+GD with large step sizes to solve the classification problem and prove that this complexity is suboptimal. To address this, we propose a new method, Normalized LR+GD---based on the connection between LR+GD and the perceptron algorithm---with much better theoretical guarantees.
6: \end{abstract}
7: