abstract:06c7b82ebe5f96b8.tex

1: \begin{abstract}

2: Excessive computational cost for learning large data and streaming data can be alleviated by using stochastic algorithms, such as stochastic gradient descent and its variants. Recent advances improve stochastic algorithms on convergence speed, adaptivity and structural awareness. However, distributional aspects of these new algorithms are poorly understood, especially for structured parameters. To develop statistical inference in this case, we propose a class of {\em generalized} regularized dual averaging (gRDA) algorithms with constant step size, which improves RDA \citep{X10,FB17}. Weak convergence of gRDA trajectories are studied, and as a consequence, for the first time in the literature, the asymptotic distributions for online $\ell_1$ penalized problems become available. These general

3: results apply to both convex and non-convex differentiable loss functions, and in particular, recover the existing regret bound for convex losses \citep{NJLS09}. As important applications, statistical inferential theory on online sparse linear regression and online sparse principal component analysis are developed, and are supported by extensive numerical analysis. Interestingly, when gRDA is properly tuned, support recovery and central limiting distribution (with mean zero) hold simultaneously in the online setting, which is in contrast with the biased central limiting distribution of batch Lasso \citep{KF00}. Technical devices, including weak convergence of stochastic mirror descent, are developed as by-products with independent interest. Preliminary empirical analysis of modern image data shows that learning very sparse deep neural networks by gRDA does not necessarily sacrifice testing accuracy.% \Cheng{mention that our theory for SMD is also new, d-SMD can be applied to deep neural networks to promote sparsity in network training and select optimal step sizes by combining with a dynamic programming approach.}

4: \end{abstract}

5: