abstract:0d96e56ba4d0f0fb.tex

1: \begin{abstract}

2:     Leveraging second-order information about the loss at the scale of deep networks is one of the main lines of approach for improving the performance of current optimizers for deep learning.

3:     Yet, existing approaches for accurate full-matrix preconditioning, such as Full-Matrix Adagrad (GGT) or Matrix-Free Approximate Curvature (M-FAC) suffer

4:     from massive storage costs when applied even to small-scale models, as they must store a sliding window of gradients, whose memory requirements are multiplicative in the model dimension.

5:     In this paper, we address this issue via a novel and efficient error-feedback technique that can be applied to compress preconditioners by up to two orders of magnitude in practice, without loss of convergence.

6:     Specifically, our approach compresses the gradient information via sparsification or low-rank compression \emph{before} it is fed into the preconditioner, feeding the compression error back into future iterations.

7:     Experiments on deep neural networks show that this approach can compress full-matrix preconditioners to up to 99\% sparsity without  accuracy loss, effectively removing the memory overhead of full-matrix preconditioners such as GGT and M-FAC.

8:     Our code is available at \url{https://github.com/IST-DASLab/EFCP}.

9: \end{abstract}

10: