1: \begin{abstract}
2: We propose a Regularized Adaptive Momentum Dual Averaging
3: (\ramda) algorithm for training structured neural networks.
4: Similar to existing regularized adaptive methods, the subproblem
5: for computing the update direction of \ramda involves a nonsmooth
6: regularizer and a diagonal preconditioner, and therefore
7: does not possess a closed-form solution in general.
8: We thus also carefully devise an implementable inexactness
9: condition that retains convergence guarantees similar to the exact
10: versions, and propose a companion efficient solver for the
11: subproblems of both \ramda and existing methods to make them
12: practically feasible.
13: We leverage the theory of manifold identification in variational
14: analysis to show that, even in the presence of such inexactness,
15: the iterates of \ramda attain the ideal structure induced by the
16: regularizer at the stationary point of asymptotic convergence.
17: This structure is locally optimal near the point of convergence,
18: so \ramda is guaranteed to obtain the best structure possible
19: among all methods converging to the same point,
20: making it the first regularized adaptive method outputting models
21: that possess outstanding predictive performance while being
22: (locally) optimally structured.
23: Extensive numerical experiments in large-scale modern computer
24: vision, language modeling, and speech tasks show that the proposed
25: \ramda is efficient and consistently outperforms state of the art
26: for training structured neural network.
27: Implementation of our algorithm is available at
28: \ifdefined\arxiv
29: \url{https://www.github.com/ismoptgroup/RAMDA/}.
30: \else
31: (removed for anonymous review).
32: \fi
33: \end{abstract}
34: