abstract:b54ca7b7de2446c4.tex

1: \begin{abstract}

2: %

3: Adaptive gradient-based optimizers such as Adagrad and Adam are crucial for

4: achieving state-of-the-art performance in machine translation and language

5: modeling. However, these methods maintain second-order statistics for each

6: parameter, thus introducing significant memory overheads that restrict the

7: size of the model being used as well as the number of examples in a

8: mini-batch. We describe an effective and flexible adaptive optimization method

9: with greatly reduced memory overhead. Our method retains the benefits of

10: per-parameter adaptivity while allowing significantly larger models and batch

11: sizes. We give convergence guarantees for our method, and demonstrate its

12: effectiveness in training very large translation and language models with up

13: to 2-fold speedups compared to the state-of-the-art.

14: %

15: \end{abstract}

16: