1: \begin{abstract}%
2: We consider feature selection for applications in machine learning where the dimensionality of the data is so large that it exceeds the working memory of the (local) computing machine. Unfortunately, current large-scale sketching algorithms show poor memory-accuracy trade-off in selecting features in high dimensions due to the irreversible collision and accumulation of the stochastic gradient noise in the sketched domain. Here, we develop a second-order feature selection algorithm, called \bears{}, which avoids the extra collisions by efficiently storing the \emph{second-order stochastic gradients} of the celebrated Broyden–Fletcher–Goldfarb–Shannon (BFGS) algorithm in Count Sketch, using a memory cost that grows sublinearly with the size of the feature vector. \bears{} reveals an unexplored advantage of second-order optimization for memory-constrained high-dimensional gradient sketching. Our extensive experiments on several real-world data sets from genomics to language processing demonstrate that \bears{} requires \emph{up to three orders of magnitude} less memory space to achieve the same classification accuracy compared to the first-order sketching algorithms with a comparable run time. Our theoretical analysis further proves the global convergence of \bears{} with $\mathcal{O}(1/t)$ rate in $t$ iterations of the sketched algorithm.
3: \end{abstract}
4: