b54ca7b7de2446c4.tex
1: \begin{abstract}
2: %
3: Adaptive gradient-based optimizers such as Adagrad and Adam are crucial for
4: achieving state-of-the-art performance in machine translation and language
5: modeling. However, these methods maintain second-order statistics for each
6: parameter, thus introducing significant memory overheads that restrict the
7: size of the model being used as well as the number of examples in a
8: mini-batch. We describe an effective and flexible adaptive optimization method
9: with greatly reduced memory overhead. Our method retains the benefits of
10: per-parameter adaptivity while allowing significantly larger models and batch
11: sizes. We give convergence guarantees for our method, and demonstrate its
12: effectiveness in training very large translation and language models with up
13: to 2-fold speedups compared to the state-of-the-art.
14: %
15: \end{abstract}
16: