abstract:81bfc9e1af07b15f.tex

1: \begin{abstract}

2: We propose a Regularized Adaptive Momentum Dual Averaging

3: (\ramda) algorithm for training structured neural networks.

4: Similar to existing regularized adaptive methods, the subproblem

5: for computing the update direction of \ramda involves a nonsmooth

6: regularizer and a diagonal preconditioner, and therefore

7: does not possess a closed-form solution in general.

8: We thus also carefully devise an implementable inexactness

9: condition that retains convergence guarantees similar to the exact

10: versions, and propose a companion efficient solver for the

11: subproblems of both \ramda and existing methods to make them

12: practically feasible.

13: We leverage the theory of manifold identification in variational

14: analysis to show that, even in the presence of such inexactness,

15: the iterates of \ramda attain the ideal structure induced by the

16: regularizer at the stationary point of asymptotic convergence.

17: This structure is locally optimal near the point of convergence,

18: so \ramda is guaranteed to obtain the best structure possible

19: among all methods converging to the same point,

20: making it the first regularized adaptive method outputting models

21: that possess outstanding predictive performance while being

22: (locally) optimally structured.

23: Extensive numerical experiments in large-scale modern computer

24: vision, language modeling, and speech tasks show that the proposed

25: \ramda is efficient and consistently outperforms state of the art

26: for training structured neural network.

27: Implementation of our algorithm is available at

28: \ifdefined\arxiv

29: \url{https://www.github.com/ismoptgroup/RAMDA/}.

30: \else

31: (removed for anonymous review).

32: \fi

33: \end{abstract}

34: